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Abstract 

In  numerous  current  and  future  applications  ranging  from  autonomous  navigation  of 
mobile  robots  to  collision  avoidance  systems  for  cars,  an  imaging  system  (installed 
on  a  moving  vehicle)  takes  2D  images  of  an  environment  with  the  aim  of  finding  the 
motion  of  the  vehicle  (translational  and  rotational  velocities)  as  well  as  the  structure 
of  the  environment  (shape).  In  machine  vision,  this  problem  is  referred  to  as  the 
general  motion  vision  problem. 

This  thesis  introduces  a  direct  method  called  fixation  for  solving  this  general  mo¬ 
tion  vision  problem,  arbitrary  motion  relative  to  an  arbitrary  environment.  Avoiding 
feature  correspondence  and  optical  flow  has  been  the  motivation  behind  this  direct 
method  which  uses  the  spatio-temporal  brightness  gradients  of  the  images  directly. 
The  fixation  method  results  in  a  linear  constraint  equation  {Fixation  Constraint  Equa¬ 
tion)  which  explicitly  expresses  the  rotational  velocity  in  terms  of  the  translational 
velocity.  The  combination  of  this  constraint  equation  with  the  Brightness-Change 
Constraint  Equation  (a  fundamental  equation  which  relates  the  motion  to  the  bright¬ 
ness  gradients  at  any  image  point)  solves  the  general  motion  vision  problem. 

In  contrast  to  previous  direct  methods,  the  fixation  method  does  not  impose  any 
severe  restrictions  on  the  motion  or  the  environment.  Moreover,  the  fixation  method 
neither  requires  tracked  images  as  its  input  nor  uses  tracking  for  obtaining  fixated 
images.  Instead,  it  introduces  a  novel  technique  called  the  pixel  shifting  process  to 
construct  fixated  images  for  any  arbitrary  fixation  point.  This  is  done  entirely  in 
software  without  any  need  to  move  the  imaging  system  for  trucking. 

This  fixation  method  has  been  successfully  tested  in  the  real  world  environment 
for  the  recovery  of  the  motion  and  shape  in  the  general  case.  The  experimental  results 
are  presented  and  the  implementation  issues  and  techniques  are  discussed. 
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Introduction 

Chapter  1 


One  of  the  principal  objects  of  theoretical  research  in 
any  department  of  knowledge  is  to  find  the  point  of  the  w 
from  which  the  subject  appears  in  its  greatest  simplicity. 

-Josiah  Willard  Gibbs 

A  little  thought  about  the  role  of  vision  in  the  tasks  that  humans  perform  in  their 
everyday  life  leaves  no  doubt  about  its  importance.  For  the  past  several  decades, 
physiologists  and  psychophysicists  have  been  striving  to  understand  the  underlying 
mechanisms  of  human  vision.  On  a  parallel  track,  computer  vision  scientists  have 
been  working  on  the  development  of  artificial  systems  for  performing  different  visual 
tasks. 

1.1  Motion  Vision 

In  many  applications,  an  imaging  system  (installed  on  a  moving  vehicle)  takes  2D 
images  of  the  environment.  In  motion  vision,  the  goal  is  to  find  the  motion  of  the 
moving  vehicle  (translational  and  rotational  velocities)  as  well  as  the  shape  (structure 
of  the  environment),  using  a  sequence  of  time  varying  images  such  as  those  shown  in 
fig.  1-1. 

Like  many  other  vision  problems,  motion  vision  is  extremely  hard  to  accomplish. 
The  difficulties  stem  from  three  major  sources; 
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Figure  1-1:  A  sequence  of  real  images  where  the  motion  between  two  images  is  a 
combination  of  translation  and  rotation. 


•  Underconstrained: 

Deriving  3D  information  (motion  and  shape)  from  2D  data  (imagi  ^,  .o  a  severely 
under  constrained  problem  (i.e.  an  infinite  number  of  solutions  are  potentially  rt)n- 
sistent  with  the  given  data). 

•  Huge  Amount  of  Data: 

Processing  even  a  single  regular  size  image  (512  x  012  pixtls)  reciuires  handling 
of  about  a  quarter  million  pixels  worth  of  data. 

•  Noise:  Real  image  data  are  very  noisy. 


1.2:  Previous  Work  (Main  Approaches) 
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1.1.1  Problem  statement 

The  problem  which  we  have  addressed  in  this  thesis  can  be  summarized  as  follows: 

Finding  the  motion  {relative  translation  and  rotational  veloritie-A.  and 
shape  {environment  structure)  from  a  setjuence  of  two  real  images  by  a 
direct  method  (not  using  either  optical  flow  or  feature  correspondence) 
in  the  general  case  {without  restricting  the  motion  or  shape). 


1.2  Previous  Work  (Main  Approaches) 

People  have  been  working  on  motion  vision  problems  for  several  decades  using  three 
major  techniques  which  are  optical  flow,  feature  correspondence,  and  direct  method. 

A  survey  of  previous  literature  on  machine  vision  is  given  in  [11]  and  a  partial  list 
of  last  year  papers  in  computer  vision  is  compiled  in  [51].  Some  of  the  current  issues  in 
image  flow  theory  and  motion  vision  are  discussed  in  [88,  4,  55).  Much  of  the  earlier 
work  on  recovering  motion  has  been  based  on  either  establishing  correspondences 
between  the  images  of  prominent  features  (points,  lines,  contours,  and  so  on)  in  an 
image  sequence,  the  so  called  feature  correspondence  [48,  80,  81,  35,  3]  or  establishing 
the  velocity  of  points  over  the  whole  image,  commonly  referred  to  as  the  optical  flow 
[8,  14,  2]. 

Each  of  the  main  approaches  {optical  flow,  feature  correspondence,  and  direct 
methods)  are  described  briefly  in  this  section  and  an  example  is  given  for  each  case 
using  the  real  image  sequence  in  fig.  l-l. 

1.2.1  Optical  flow 

The  computation  of  the  local  flow  field  exploits  a  constraint  equation  between  the 
local  brightness  changes  and  the  two  components  of  the  optical  flow.  This  only  gives 
the  components  of  flow  in  the  direction  of  the  brightness  gradient.  To  compute  the 
full  flow  field,  one  needs  additional  constraints  such  as  the  heuristic  assumption  that 
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the  flow  field  is  locally  smooth  [30,  28].  This  leads  to  an  estimated  optical  flow  field 
which  may  not  be  the  same  as  the  true  motion  field. 

Figure  1-2  shows  an  optical  flow  field  for  the  image  sequence  given  in  fig.  1-1. 
The  size  and  direction  of  the  apparent  velocity  at  any  pixel  is  shown  by  an  arrow. 
Instead  of  the  original  images,  such  optical  flow  fields  are  used  as  a  primary  source 
of  information  in  the  optical  flow  techniques. 

The  irregular  optical  flows  on  the  upper  edge  of  this  map  are  probably  tine  to  the 
noise  and  inherent  errors  involved  in  the  computations  at  the  image  borders. 


Figure  1-2:  The  optical  flow  map  for  the  given  real  image  sequence.  The  arrows  show 
the  magnitude  and  direction  of  the  apparent  motion  at  each  point. 

1.2.2  Feature  correspondence 

In  general,  identifying  features  here  means  determining  gray-level  corners.  For  images 
of  smooth  objects,  it  is  difficult  to  find  good  features  or  corners.  Furthermore,  the 
correspondence  problem  has  to  be  solved,  that  is,  feature  points  from  consecutive 
frames  have  to  be  matched. 

Figure  1-3  shows  the  edge  map  for  the  top  image  in  fig.  1-1.  Several  correspondence 
methods  use  such  edge  maps  as  the  basic  source  of  data  instead  of  the  original  image. 
Then,  they  try  to  find  some  common  features  in  different  edge  maps  and  relate  them 
together. 


/.i;  Prei'ious  Work  (Main  Approaches) 
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Figure  1-3:  The  edge  maps  for  each  of  the  images  in  the  sequence. 

1.2.3  Direct  methods 

The  use  of  optical  flow  or  correspondence  techniques  for  solving  motion  vision  prob¬ 
lems  has  proven  to  be  rather  unreliable  and  computationally  expensive  [84,  83,  34], 
These  techniques  spend  a  lot  of  effort  on  transforming  the  original  images  to  the 
optical  flow  or  the  edge  maps.  The  assumptions  made  in  these  procedures  result  in 
errors  and  loss  of  some  useful  information  which  exists  in  the  original  images. 

These  problems  have  motivated  the  investigation  of  direct  methods  which  use  the 
image  brightness  information  directly  to  recover  the  motion  and  shape  without  any 
need  to  preprocess  the  original  image. 

Previous  work  in  direct  motion  vision  has  used  the  Brightness-Change  Consfi-aint 
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Equation  (B(’C'E)  for  rigid  body  motion  [44] 


„  s  •  t 

+  V  •  a;  +  — —  =  0 


1.1 


to  solve  special  cases  such  as  known  depth  [30],  pure  translation  or  known  rotation 
[31],  pure  rotation  [31],  and  planar  world  [44].  Chapter  2  describes  the  details  of  this 
nonlinear  equation  which  relates  depth  Z,  translational  velocity  t.  and  rotational 
velocity  u)  together. 

All  these  direct  methods  are  restricted  in  the  types  of  motion  or  shape  that  they 
can  handle.  Our  aim  is  to  solve  the  motion  vision  problem  in  the  general  case  using  a 
direct  method  but  without  restricting  either  the  motion  or  the  shape  to  any  special 
case. 


1.3  Fixation  Approach 

This  thesis  presents  a  direct  method  called  fixation  for  solving  the  motion  vision 
problem  in  the  general  case  without  placing  any  restrictions  on  the  motion  or  the 
shape  [65,  69,  60].  The  fixation  method  is  based  on  the  theoretical  proof  that  for  a 
sequence  of  fixated  images  (a  sequence  of  images  with  one  stationary  image  point  in 
them),  the  3D  rotational  velocity  a;  can  always  be  explicitly  expressed  in  terms  of  a 
linear  function  of  the  3D  translational  velocity  t.  Namely, 

t*' =  +  r^(t  X  Ro)  (1.2) 

where  Rq  is  the  unit  vector  along  the  position  vector  of  the  fixation  point  (a  point 
in  the  image  plane  which  stays  stationary)  and  is  the  component  of  rotational 
velocity  about  the  fixation  axis  Rq. 

It  should  be  emphasized  that  we  do  not  need  to  know  the  real  fixation  point,  if 
there  is  any,  to  take  advantage  of  this  fixation  constraint  equation  (FCE),  eqn.  1.2.  In 
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fact,  our  algorithm  allows  us  to  choose  virtually  any  point  as  the  fixation  point  and 
obtain  a  sequence  of  fixated  images  [65,  69]  by  a  simple  software  manipulation  of  one 
of  the  original  images 

The  combination  of  the  Fixation  Constraint  Equation  (FCE),  eqn.  1.2.  and  the 
BCCE,  eqn.  1.1  offers  a  solution  to  the  motion  vision  problem  of  arbitrary  motion 
relative  to  an  arbitrary  rigid  environment.  That  is.  it  allows  recovery  of  the  depth 
map  Z,  total  3D  rotational  velocity,  and  3D  translational  velocity  t  without  placing 
severe  restrictions  on  the  motion  or  the  shape  [65,  69]. 

1.4  Contributions 

A  summary  of  the  principal  contributions  of  this  thesis  are  as  follows. 

•  Derivation  of  the  Fixation  Constraint  Equation: 

Deriving  a  strong  constraint  equation  called  the  fixation  constraint  equation  (FC’E). 
This  constraint  equation  has  a  solid  mathematical  foundation.  It  expresses  that  for  a 
sequence  of  fixated  images,  the  rotational  velocity  can  always  be  explicitly  expressed 
bls  a  linear  function  of  translational  velocity  [69,  62,  61].  This  equation  is  general  and 
no  hidden  assumptions  were  made  in  its  derivation. 

•  Obtaining  a  solution  to  the  general  motion  problem: 

Introducing  a  direct  method  called  the  fixation  method  which  provides  a  solution  for 
the  general  motion  vision  problem  and  has  the  following  properties  [69.  60.  63]  : 

-  Finds  the  motion  {translational  and  rotational  velocities),  and  shape  (the  environ¬ 
ment  structure)  from  two  monocular  images. 

-  Does  not  restrict  the  motion  or  shape, 

-  Does  not  use  either  optical  flow  or  feature  correspondence. 

-  Is  computationally  simple. 

•  Tracking  without  moving  the  camera: 

Present, ig  a  novel  method  called  the  pixel  shifting  processlor  constructing  a  sequence 
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of  fixated  (tracked)  images  from  any  arbitrary  image  sequence.  [6o.  64].  It  allows  an 
arbitrary  choice  of  fixation  point,  is  fully  software  based,  and  does  not  require  moving 
the  camera  for  tracking. 

•  Autonomous  choice  of  an  optimum  fixation  patch  size: 

Finding  a  technique  for  autonomous  choice  of  an  optimum  fixation  patch  size  which 
results  in  good  estimates  for  the  motion  parameters.  This  technique  is  based  on 
defining  a  norm  called  normalized  error  and  has  been  successfully  implemented  and 
tested  on  real  images  [68.  72,  66). 

•  Autonomous  choice  of  an  appropriate  fixation  point  location: 

Some  regions  of  a  given  image  are  better  for  using  a  fixation  patches.  We  have 
developed  a  method  for  autonomous  choice  of  an  appropriate  fixation  point  location 
[67,  72]. 

•  Rotation  axis  calibration: 

Introducing  a  procedure  for  the  calibration  of  a  rotation  axis  in  imaging  systems.  This 
technique  is  simple  but  useful  and  results  in  avoiding  potential  implementation  errors 
[70,  72]. 

•  Representing  image  gradients: 

A  novel  method  has  been  presented  for  visual  representation  of  the  spatio-te mporal 
gradients.  These  intensity  gradient  maps  allow  one  to  visually  understand  the  char¬ 
acteristics  and  significance  of  the  brightness  gradients  [73.  70]. 

•  Constructing  fixated  (tracked)  image  sequences: 

Using  the  pixel  shifting  process  and  a  bilinear  interpolation  technique  we  have  con¬ 
structed  fixated  images  from  real  images  [73,  70]. 

•  Depth  map  recovery  from  two  monocular  real  images: 

We  have  recovered  good  depth  maps  from  two  monocular  real  images  using  the  fixa¬ 
tion  method  [71.  70]. 
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1.5  Thesis  Structure 

This  work  comprises  of  three  parts:  Theory.  Implementation,  and  .Appendices. 


1.5.1  Part  I;  Theory 

This  part  covers  the  mathematical  background  of  direct  methods  and  the  detailed 
theory  of  fixation. 

•  Chapter  2 

VVe  begin  with  a  description  of  the  camera  model  and  coordinate  system  used  in 
this  work.  Then,  the  brightness  change  constraint  equation  (BCCE)  used  by  direct 
methods  is  explained. 

•  Chapter  3 

This  chapter  presents  the  main  idea  behind  our  fixation  method.  It  shows  how  the 
Fixation  Constraint  Equation  (FCE)  is  derived  and  how  it  can  be  combined  with  the 
BCCE  in  order  to  solve  for  the  translational  velocity  t.  rotational  velocity  u.  and  the 
depth  Z  at  any  image  point. 

•  Chapter  4 

In  an  arbitrary  image  sequence,  a  point  chosen  as  the  fixation  point  does  not  neces¬ 
sarily  stay  stationary  in  the  image  plane.  This  chapter  introduces  the  algorithms  for 
the  estimation  of  the  apparent  velocity  at  the  fixation  point  (fixation  velocity)  which 
are  required  for  the  construction  of  a  sequence  of  fixated  images.  Simultaneously, 
these  algorithms  find  an  estimate  for  the  component  of  the  rotational  velocity  along 
the  fixation  axis,  u;r„.  which  appears  in  the  FCE. 

•  Chapter  5 

The  fixation  method  requires  a  sequence  of  fixated  images.  This  chapter  shows  how 
a  sequence  of  fixated  images  can  be  constructed  from  an  arbitrary  image  sequence 
using  the  components  of  the  fixation  velocity. 

•  Chapter  6 
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This  chapter  ends  the  basic  theoretical  part  of  the  thesis  by  giving  an  overview  of  the 
main  modules  involved  in  the  fixation  method. 

1.5.2  Part  II:  Implementation 

This  part  presents  the  experimental  results  of  applying  the  algorithms  given  in  Part 
/  to  real  image  sequences.  The  implementation  issues  are  described  along  with  tech¬ 
niques  for  dealing  with  some  practical  problems. 

•  Chapter  7 

The  spatio-temporal  brightness  gradients  of  the  images  are  the  primary  source  of 
data  used  in  our  fixation  method.  This  chapter  introduces  a  novel  technique  for 
representing  the  gradients  of  real  images.  Such  representations  allow  us  to  have  a 
better  insight  about  the  characteristics  and  significance  of  gradients. 

•  Chapter  8 

The  experimental  results  in  this  chapter  show  that  the  estimated  values  for  the  com¬ 
ponents  of  the  fixation  velocity  and  depend  heavily  on  the  size  of  the  image 
patch  used  in  the  computation.  It  will  be  shown  that  depending  on  the  image,  and 
the  fixation  point  location,  there  are  some  patch  sizes  which  result  in  good  estimates 
for  the  desired  motion  parameters. 

•  Chapter  9 

This  chapter  presents  a  novel  and  reliable  technique  for  autonomous  choice  of  an 
optimum  fixation  patch  size  that  results  in  good  estimations  for  the  motion  parameters 
from  real  noisy  images. 

•  Chapter  10 

The  fixation  method  does  not  place  any  restrictions  on  the  choice  of  the  fixation 
point  and  virtually  any  point  can  be  chosen  as  the  fixation  point.  However,  some 
considerations  should  be  taken  into  account  when  choosing  a  fixation  point.  For 
example,  choosing  a  point  at  the  center  of  a  patch  which  has  uniform  brightness  is  not 
good  because  the  motion  is  not  detectable.  This  chapter  introduces  an  autonomous 
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technique  for  choosing  an  appropriate  fixation  point. 

•  Chapter  11 

Not  only  in  our  fixation  technique  but  also  in  many  other  methods  there  is  a  substan¬ 
tial  need  for  a  sequence  of  fixated  (tracked)  images.  This  chapter  introduces  a  novel 
method  {pixel  shifting  process)  for  constructing  a  sequence  of  fixated  images  from  an 
arbitrary  image  sequence  using  the  components  of  the  fixation  velocity. 

•  Chapter  12 

Tsing  the  estimated  motion  parameters  and  the  constructed  sequence  of  fixated  im¬ 
ages.  this  chapter  describes  the  issues  involved  in  recovering  depth  maps.  Detailed 
techniques  are  presented  for  overcoming  practical  problems  such  as  noise  and  inherent 
image  deficiencies. 

•  Chapter  13 

Camera  calibration  is  usually  an  unavoidable  requirement  for  working  with  real  im¬ 
ages.  This  chapter  discusses  some  of  the  calibration  issues  that  we  faced  in  this 
work. 

•  Chapter  14 

We  conclude  this  work  by  giving  a  summary  of  the  fixation  method,  results,  features, 
assumptions,  shortcomings,  relation  to  other  works,  and  finally  some  thoughts  on  the 
possible  future  extensions. 

1.5.3  Part  III:  Supplements 

Some  of  the  relevant  theoretical  proofs  and  formulations  are  summarized  in  this  part. 

•  Appendix  A 

Provides  a  detailed  derivation  of  the  BCCE. 

•  Appendix  B 

Presents  the  formulations  for  computing  the  spatio-temporal  gradients. 

•  Appendix  C 

Descril)es  a  technique  for  computing  the  depth  at  the  fixation  point.  Zo- 
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Images  are  usually  obtained  from  a  regular  electronic  camera  where  the  projection 
is  perspective.  In  this  chapter,  we  first  describe  the  camera  model  and  the  coordi¬ 
nate  system  used  in  this  work.  Then,  a  mathematical  background  of  the  BCCE  is 
presented. 

2.1  Modeling  and  Coordinate  System 

As  shown  in  fig.  2-1,  the  coordinate  system  is  attached  to  the  camera  so  that  its  origin 
is  located  at  the  projection  center. 

The  image  plane  is  where  the  environment  image  is  projected  to.  In  an  electronic 
camera,  a  CCD  {Charge  Coupled  Device)  plays  the  role  of  the  image  plane.  The  CCD 
is  an  electronic  light-sensitive  plane.  It  consists  of  a  tessellation  of  small  rectangular 
or  square  photo-sensitive  cells  which  are  called  pixels.  Each  pixel  of  the  CCD  is 
electronically  charged  depending  on  the  number  of  the  photons  it  receives.  Thus,  the 
charge  level  of  each  pixel  is  a  representation  of  the  brightness  at  the  corresponding 
point  in  the  image  plane.  By  reading  and  appropriate  conversion  of  the  camera  charge 
level  of  all  pixels,  the  image  can  be  written  in  a  file  or  displayed  on  a  screen. 

The  image  plane  in  our  coordinate  system  is  parallel  to  the  .Y  —  V  plane  and  is 
located  at  a  distance  equal  to  the  focal  length  from  it.  The  optical  aj?.s  Z  pierces  the 
image  plane  at  a  point  which  is  called  the  principal  point.  Any  environment  point  R 
is  projected  to  an  image  point  r  in  this  coordinate  system. 


2:f 
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Figure  2-1:  The  coordinate  system  is  attached  to  the  camera  and  the  projection  is 
perspective. 


2.2  Basic  Definitions 


Using  a  viewer-centered  coordinate  system  which  is  adopted  from  Longuet-Higgins 
Prazdny  [36]  is  very  common  in  direct  motion  vision.  Figure  2-2  depicts  the  coordinate 
system  under  consideration. 

In  such  a  coordinate  system,  a  world  point 


R  =  (X  Y  Zf 


(2.1 


is  imaged  at 


r  =  (x  j/  1)^. 


•)  ■)] 


That  is.  the  image  plane  ha.s  the  equation  Z  =  1  or  in  other  words  the  focal  length 
/is  1.  The  origin  is  at  the  projection  center  and  the  Z-axis  runs  along  the  optical 
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Figure  2-2;  Under  the  effect  of  translational  velocity  of  the  viewer  is  t  =  {U  V  W)'^ 
and  rotational  velocity  is  u;  =  B  C)^,  any  environment  point  R  has  the  velocity 
R(  from  the  observer’s  point  of  view. 


axis.  The  .Y  and  Y  axes  are  parallel  to  the  x  and  y  axes  of  the  image  plane.  Image 
coordinates  are  measured  relative  to  the  principal  point,  the  point  (0  0  1)^  where  the 
optical  axis  pierces  the  image  plane.  The  position  vectors  r  and  R  are  related  by  the 
perspective  projection  equation 


r  =  (x  y 


(2.3) 


where  z  denotes  the  unit  vector  along  the  Z— axis  and  R  •  z  =  Z. 

When  the  observer  moves  with  instantaneous  translational  velocity  t  =  {C  V’  lU)^ 
and  instantaneous  rotational  velocity  u;  =  (.4  B  C)^  relative  to  an  environment,  then 
the  time  derivative  of  the  position  vector  of  a  point  in  the  environment.  R,  relative 
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to  the  observer  can  be  written  as 


R(  =  — t  —  u;  X  R. 


(2.4) 


The  motion  of  the  world  point  R  results  in  the  motion  of  its  corresponding  image 
point  r.  It  can  be  shown  that  the  motion  field  in  the  image  plane  is  obtained  by 
differentiating  eqn.  2.3  with  respect  to  time  as  in  [44] 


d  /  R  z  X  (R;  X  r) 

df  \  R  •  2  /  R  •  z 


(2.0) 


Substituting  for  R,  r  and  R^  from  equations  2.1,  2.2,  and  2.4  into  eqn.  2.5  gives 
[36,  14] 


/  \ 
Xt 

^  =£^  +  Axy-B{x^  +  l)  +  Cy  \ 

r(  = 

yt 

= 

Bxy  +  A{y^+l)-Cx 

«  i 

This  result  is  just  the  parallax  equations  of  photogrammetry  that  occur  in  the  incre¬ 
mental  adjustment  of  relative  orientation  [23.  42].  It  shows  how,  given  the  environ¬ 
ment  motion,  the  motion  field  can  be  calculated  for  every  image  point. 


2.3  The  Brightness  Change  Constraint  Equation 

Image  brightness  changes  are  primarily  due  to  the  relative  motion  between  an  en¬ 
vironment  and  an  observer  provided  that  the  surfaces  of  the  objects  have  sufficient 
texture  and  the  lighting  condition  varies  slowly  enough  both  spatially  and  with  time. 
In  such  cases  (which  may  occur  in  practical  applications),  brightness  changes  due  to 
the  variations  in  the  surface  orientation  and  illumination  can  be  neglected.  Conse¬ 
quently.  we  may  assume  that  the  brightness  of  a  small  patch  on  a  surface  in  the  scene 
does  not  change  during  motion.  .As  shown  in  appendix  ,A.  when  the  motion  is  small 
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the  expansion  of  the  total  derivative  of  brightness  E  leads  to 

dE 

—  =  Et  +  xtE^-\-yiEy  =  {li^  (2.7) 

known  as  the  Brightness  Change  Constraint  Equation  (BCCE)  where  {Ex.Ey)  and 
Et  are  spatial  and  temporal  gradients  of  the  image  brightness  at  any  given  pixel 
[30.  54.  29], 

Note  that  eqn.  2.7  does  not  hold  for  the  special  case  that  the  viewer  and  the 
light  source  are  stationary  and  the  environment  moves  relative  to  them  because  the 
brightness  of  a  surface  patch  does  not  remain  constant  in  this  case. 


2.3.1  Rigid  body  motion 

In  rigid  body  motion,  there  is  only  one  relative  motion  between  the  observer  and  the 
environment.  For  this  case,  we  can  substitute  for  X{  and  j/j  from  eqn.  2.6  into  eqn.  2.7. 
to  obtain  the  brightness-change  constraint  equation  for  the  rigid  body  motion  [44]  as 

£■(  4- V  ■  u; -I- =  0.  (2.8) 


This  equation  is  nonlinear  in  terms  of  unknowns  rotation  u;.  translation  t,  and  depth 
Z.  The  auxiliary  vectors  s  and  v  are  known  at  any  pixel  (x,y)  and  are  defined  as 


s  = 


/ 


xEr  +  yEy  j 


(2.9) 


‘To  account  for  smooth  variations  in  the  image  brightness  due  to  other  factors  such  as  shading, 
spatial  and  temporal  illumination  changes,  and  variations  in  reflectance  properties,  the  BCC’E  can 
be  extended  to 


Et  +  xtEr  +  ytEy  =  mtE  +  c, 


where  in  general  m,  and  c,  are  time  and  position  dependent  [21,  4.5].  Cornelius  ik  Kanade  [17]  also 
propose  a  method  which  allows  gradual  changes  in  These  extensions  are  not  discussed  here. 
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and 


V  = 


/ 


+  Ey  +  y{xEr  +  yEy) 
-Er  -  x{xEr -h  yEy) 


(2.10) 


y  yEr-xEy  j 

Since  s-r  =  0,  v-r  =  0  and  s •  v  =  0,  the  vectors  r,  s,  and  v  form  an  orthogonal  triad: 
see  fig.  2-.3.  The  vectors  s  and  v  represent  inherent  properties  of  the  image.  Also  it 
can  be  shown  that  v  =  r  x  s.  The  vector  s  indicates  the  direction  in  which  translation 
of  a  given  magnitude  will  contribute  maximally  to  the  temporal  brightness  change  of 
a  given  picture  cell.  The  vector  v  plays  a  similar  role  for  rotation. 


v  =  rxs 


Figure  2-3;  At  any  pixel,  vectors  r  (pixel  position),  s,  and  v  form  an  orthogonal  triad. 
Also  V  =  r  X  s. 

The  BCCE.  eqn.  2.7,  does  not  change  if  w'e  scale  both  Z  and  t  by  the  same  factor. 
Consequently,  we  can  determine  only  the  direction  of  translational  velocity  and  the 
relative  depth  of  points  in  the  scene.  This  ambiguity  is  known  as  the  scale-factor 
ambiguity  in  motion  vision. 

Equation  2.7  is  obtained  under  the  following  assumptions: 

•  .No  noise, 

•  Sufficient  surface  texture. 
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•  Slow  spatio-temporal  variations  in  lighting, 

•  Small  motions  betw'een  frames. 

In  real  images,  violation  of  any  of  these  conditions  may  cause  eqn.  2.7  not  to  be 
held  at  any  single  pixel.  However,  later  we  will  show  how  this  equation  can  be  used  in 
a  least  squares  method  for  recovery  of  shape  and  motion  from  real  image  sequences. 


Fixation  Formulation 

Chapter  3 


Our  common  visual  experience  suggests  that  fixation  may  play  an  important  role 
in  the  analysis  of  moving  objects.  When  we  want  to  understand  the  motion  of  an 
object,  we  do  not  keep  our  eyes  and  head  stationary  in  front  of  the  moving  object. 
Instead,  our  head  and/or  eyes  follow  the  moving  object,  in  order  to  keep  the  image 
of  a  point  of  interest  stationary  in  the  retina.  There  are  also  some  formal  studies 
that  support  such  observations  [6,  7,  9].  In  this  computer  vision  work,  the  fixation  is 
defined  as: 

Given  two  subsequent  images,  Island  2nd  initial  images,  and  an  arbitrary 
point  in  the  1st  initial  image,  find  a  new  image,  a  2nd  fixated  image,  such 
that  the  image  of  the  selected  point  in  the  new  image  is  located  at  its 
original  position  a.s  in  the  1st  initial  image. 

This  definition  of  fixation  is  shown  schematically  in  fig.  3-1.  If  we  choose  point  1 
in  the  1st  initial  image  as  the  fixation  point,  its  image  in  the  2nd  initial  image  may 
move  to  a  new  location  such  as  2.  In  chapter  .5.  we  introduce  a  simple  technique  for 
converting  the  2nd  initial  image  in  order  to  bring  image  point  2  to  the  same  physical 
location  as  point  1.  This  process  will  construct  the  2nd  fixated  image  and  form  a 
sequence  of  images  fixated  at  point  1. 

.■\s  shown  in  fig.  3-2.  we  refer  to  this  arbitrary  selected  image  point  as  the  fixation 
point.  To.  and  to  its  corresponding  point  on  the  object  as  the  interest  point.  Ro. 
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2nd  Initial  Image 


2nd  Fixated  Image 


Figure  3-1:  A  schematic  interpretation  of  fixation  point  and  fixatfd  imagt  sfqufttct. 


3.1  Derivation  of  the  General  Fixation  Constraint 
Equation 

For  a  sequence  of  two  fixated  images,  at  the  fixation  point  To  we  should  have 


ro(  =  0 


3.1:  Derivation  of  the  General  Fixation  Constraint  Equation 
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is  kept  stationary  in  the  image  plane  despite  the  relative  motion  between  the  camera 
and  the  environment. 


where  Tot  is  the  time  derivative  of  the  fixation  point  vector  and  similar  to  eqn.  2.5  it 


can  be  written  eis 


rot  = 


Z  X  (Rof  X  To) 

Ro  Z 


(3.2) 


Rot  is  the  time  derivative  of  the  interest  point  vector.  Combination  of  equations  3.1 
and  3.2  shows  that  for  fixation  we  need  to  have 


Z  X  (Rot  X  To)  =  0. 


(3.3) 


In  other  words,  we  want  to  find  out  when  Rot  x  To  is  zero  or  parallel  to  z.  For  R^  x  To 
to  be  parallel  to  z.  we  should  have  To  perpendicular  to  z  which  is  not  possible  with  a 
finite  field  of  view,  so  only  Rot  x  Fo  =  0  applies.  Consequently,  considering  that  Ro 
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and  To  have  the  same  direction,  eqn.  3.3  is  simplified  as 


Ro<  X  Ro  =  0 


(3.4) 


Now  substituting  for  Rot  =  — t  —  a;  x  Ro,  eqn.  2.4.  into  eqn.  3.4  gives 

(a;  X  Ro)  X  Ro  +  t  X  Ro  =  0.  (3..')) 

Expansion  of  eqn.  3.5  by  using  (a  x  b)  x  c  =  (c  ■  a)b  —  (c  •  b)a  results  in 


(Ro  •  u))Ro  —  (Ro  •  Ro)t^  +  t  X  Ro  =  0.  (3.6) 

As  long  as  the  translational  velocity  t  is  neither  zero  nor  parallel  to  the  interest 
point  vector  Ro.  then  any  vector,  including  u.  can  be  expressed  in  terms  of  the  triad 
of  vectors  Ro.  t  x  Ro  and  t.  So  we  can  write  u  in  its  general  form  as 

uj  =  qRo  -b  0(t  X  Ro)  +  7t  (3.7) 

where  o,  3  and  7  are  parameters  to  be  determined.  Later  in  this  section  we  will 
consider  the  special  cases  where  t  is  zero  or  parallel  to  Ro  by  defining  u;  ba.sed  on 
another  triad  of  vectors. 

Substituting  for  u;  from  eqn.  3.7  into  eqn.  3.6  gives 

[1  -  ;3(Ro  ■  Ro)](t  X  Ro)  +  7(Ro  ■  t)Ro  -  7(Ro  •  Ro)t  =  0.  (3.8) 

Now.  we  should  find  the  parameters  3  and  7  such  that  eqn.  3.8  huld.s  without  placing 
any  restrictions  on  either  Ro  or  t.  We  start  by  finding  the  dot  product  of  ecjn.  3.8 
with  t  X  Ro  which  results  in 

(3.!>) 


[1  -  ,3(Ro-Ro)]|It  X  Rod'  =  0. 


■3.1:  Derivation  of  the  General  Fixation  Constraint  Equation 


35 


Equation  3.9  will  hold  without  restricting  either  Ro  or  t  if 


13  = 


1 

iiRoir^‘ 


(3.10) 


.Another  possibility  for  satisfying  eqn.  3.9  is  to  have  |)t  x  Ro)|  =  0  which  implies  that 
either  t  or  Ro  is  zero,  or  t  is  parallel  to  Ro.  But  Ro  cannot  be  zero  and  also  we 
assumed  that  here  t  is  neither  zero  nor  parallel  to  Ro.  As  a  result,  ||t  x  Ro|l  cannot 
be  zero. 

Similarly  the  dot  product  of  eqn.  3.8  with  t  gives 


7(Ro  •  t)(Ro  •  t)  -  7(Ro  •  Ro)(t  •  t)  =  0.  (3.1 1 ) 

Knowing  that  (a  x  b)  •  (c  x  d)  =  (c-a)(b-d)  — (b-c)(d  a),eqn.  3.11  can  be  simplified 
as 

7||t  X  Roll'^  =  0.  (3.12) 

VVe  discussed  that  |lt  x  Ro||  cannot  be  zero  here,  so  eqn.  3.12  is  satisfied  only  if  7  is 
zero 

->=0.  (3.13) 

Substituting  for  ^  from  eqn.  3.10  and  7  from  eqn.  3.13  into  eqn.  3.7  gives 

w  =  qRo  +  jj^^-jp(t  X  Ro)  (3.14) 

where  a  is  still  unknown.  This  means  that  the  component  of  the  rotational  velocity 
along  Ro  cannot  be  determined  by  the  fixation  formulation.  Physically  this  makes 
sense  because  the  rotational  velocity  along  Ro,  denoted  by  u;it„,  does  not  move  the 
fixation  point.  This  observation  leads  us  to  find  wr,  in  a  separate  step  before  using 
the  fixation  formulation  results.  Derivation  of  will  be  shown  in  chapter  4. 
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As  a  result,  the  fixation  constramt  equation  (FC'E)  is  written  as 


UJ  =  Ro  + 


1 

W4 


(t  X  Rj 


(;hio) 


where  t  is  the  translational  velocity  and  Ro  =  To  is  the  unit  vector  along  the  position 
vector  of  an  arbitrary  fixation  point,  an  arbitrary  point  in  the  image  chosen  for 
fixation.  Equation  3.15  shows  that  after  fixation,  the  rotational  velocity  u  can  be 
explicitly  expressed  as  a  linear  function  of  the  translational  velocity  t. 


3.1.1  Derivation  of  special  fixation  constraint  equation 

When  the  translational  velocity  t  is  zero  or  parallel  to  the  interest  point  vector  Ro. 
eqn.  3.6  is  simplified  as 


(Ro  •  u;)Ro  —  (Ro  •  RoV  =  0.  (3.16) 

This  time,  u  is  defined  based  on  the  triad  consisting  of  vectors  Ro.  x.  and  x  x  Ro  as 

u;  =  /Ro  +  m(x  X  Ro )  +  nx  (3.17) 

where  /.  m,  and  n  are  parameters  to  be  determined.  Here  we  assume  that  Ro  is  not 
parallel  to  x.  This  is  a  reasonable  assumption  because  otherwise  we  should  at  least 
have  a  field  of  view  of  180°  to  be  able  to  choose  an  awkward  interest  point  along  the 
.X'-axis,  which  results  in  a  fixation  point  at  an  infinite  distance  from  the  principal 
point  and  near  the  border  of  an  infinite  image  plane. 

.Substituting  for  u>  from  eqn.  3.17  into  eqn.  3.16  gives 


n(Ro  ■  x)Ro  -  m(Ro  ■  Ro)(x  x  Ro)  -  rdRo  •  Ro)x  =  U. 


I  3. IX) 
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The  dot  product  of  eqn.  3.18  with  (x  x  Ro)  results  in 

-  m(Ro  •  Ro)l|x  X  Roll^  =  0.  (3.19) 

Considering  that  Ro  cannot  be  either  zero  or  parallel  to  x,  eqn.  3.19  is  satisfied  only 
if  m  is  zero 

m  =  0.  (3.20) 

Substituting  for  m  into  eqn.  3.18  and  finding  its  dot  product  by  x  results  in 

n(Ro  •  x)(Ro  •  x)  -  n(Ro  •  Ro)(x  •  x)  =  0.  (3.21 ) 

Using  (a  X  b)  •  (c  X  d)  =  (c  •  a)(b  •  d)  —  (b  •  c)(d  •  a),  eqn.  3.21  can  be  written  as 

n||x  X  Roll'^  =  0.  (3.22) 

Again  R,  cannot  be  either  zero  or  parallel  to  x.  As  a  result,  eqn.  3.22  will  hold  for 
arbitrary  Ro  if  n  =  0.  Substituting  for  n  and  m  into  eqn.  3.17  gives 


a;  =  /Ro  (3.23) 

where  /  is  still  unknown.  We  can  substitute  wr^Ro  for  /Ro.  The  procedure  for 
computing  the  component  of  rotational  velocity  along  the  fixation  axis.  ^'r„.  will  be 
given  in  chapter  4.  Consequently,  for  the  special  ceises  we  obtain  the  special  fixation 
constraint  equation  (SFCE)  as 

w  =  u;r,Ro  (3.24) 

which  means  that  when  the  translational  velocity  t  is  zero  or  parallel  to  Ro  then  the 
corresponding  rotational  velocity  may  only  have  a  component  along  Ro. 

This  procedure  for  deriving  the  SFCE,  eqn.  3.24.  is  not  essentially  different  from 
what  we  did  for  deriving  the  FCE.  eqn.  3.1o.  In  fact.  eqn.  3.24  is  a  special  case  of 
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eqn.  3.1o.  But  we  did  not  directly  derive  eqn.  3.24  from  eqn.  3.1o  because  eqn.  3.1o 
was  derived  based  on  the  assumption  that  t  is  neither  zero  nor  parallel  to  Ro-  .As  a 
result,  for  implementation  it  is  enough  to  use  the  FCE,  eqn.  3.15.  without  knowing 
whether  the  present  condition  is  a  special  case  or  not. 

3.1.2  Interpretation  of  the  FCE 

We  gave  a  detailed  mathematical  proof  for  derivation  of  the  fixation  constraint  equa¬ 
tion  (FCE),  eqn.  3.15.  This  constraint  equation  indicates  that  for  a  sequence  of 
fixated  images,  the  rotational  velocity  ui  can  always  be  expressed  as  a  linear  function 
of  the  translational  velocity  t.  This  section  examines  whether  the  FCE  makes  sense 
phsically. 

The  first  term  u>r,Ro  says  that  a;  can  have  an  unrestricted  component  along  the 
fixation  axis  Ro.  This  is  correct  because  such  a  component  does  not  cause  the  fixation 
point  to  move  and  as  a  result  the  fixation  is  not  violated. 

The  term  of  the  FCE,  x  Ro),  conveys  two  points: 

•  The  translation  t  can  have  an  arbitrary  component  along  the  fixation  axis  Ro 
because  such  a  component  does  not  move  the  fixation  point  in  the  image  plane. 

•  The  rotational  velocity  u;  should  have  a  component  perpendicular  to  Ro  and  be 
large  enough  to  compensate  for  the  component  of  the  translational  velocity  t  which 
is  perpendicular  to  Ro  in  order  to  keep  the  fixation  stationary  in  the  image  plane. 

We  can  conclude  that  the  FCE  has  a  meaningful  physical  interpretation. 


3.2  Solving  the  General  Direct  Motion  Vision  Prob¬ 
lem 

At  this  stage,  we  assume  that  a  sequence  of  two  fixated  images  have  been  constructed. 

In  other  words,  we  have  made  the  fixation  point  stationary  in  the  image  plane.  This 
can  be  done  first  by  finding  the  fixation  velocity,  the  apparent  velocity  at  the  fixation 
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point  in  the  1st  image,  as  shown  in  chapter  4.  Then  the  pixel  shifting  prorfS5 explained 
in  chapter  5  can  be  used  for  constructing  a  new  image,  the  2nd  fixated  image,  in  which 
the  image  of  the  interest  point  is  positioned  at  the  same  point  as  in  the  1st  initial 
image. 

We  start  by  studying  the  general  case  where  the  translational  velocity  t  is  neither 
zero  nor  parallel  to  the  interest  point  vector  Ro.  The  special  cases  of  t  will  be 
discussed  later. 

Substituting  for  uj  from  the  fixation  constraint  equation  .3.15  into  the  brightness- 
change  constraint  equation  2.  j  gives 

Et  +ujr^v-Ro  +  ^  •  t)  =  0.  (3.25) 

Knowing  that  a  •  (b  x  cl  =  (a  x  b)  •  c  and  doing  some  manipulations  on  eqn.  3.25 
results  in 

where  E[  is  a  notation  for  Et  +  u;r,v  •  Ro  which  is  computable  at  any  pixel  assuming 
that  WR,  is  known.  In  chapter  4,  we  will  introduce  a  technique  which  finds  a  good 
estimate  for  u;r„. 

In  general,  eqn.  3.26  can  be  solved  numerically  for  t  and  Z  using  images  of  any 
size  and  with  any  field  of  view.  For  a  small  patch  around  the  fixation  point,  called  a 
fixation  patch,  eqn.  3.26  can  be  simplified  as 

'Considering  that  |lRo||  =  .^o|lro||  and  v  =  r  x  s,  the  term  -J  (v  x  R,,)  from  eqn.  3.26,  let's 
call  it  A',  can  be  expanded  as 

^  =  Z.lir.ll^''  s)  ^  ptii- 

Further  expansion  of  A'  by  using  the  relation  (a  x  b)  x  c  =  (c  a)b  —  (c  ■  b)a.  results  in 
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As  described  in  the  footnote,  the  approximation  made  here  is  based  on  a  purelv 
geometric  assumption  and  is  not  related  to  the  image  properties.  For  example,  we 
are  nut  making  any  assumptions  about  the  depth  topology.  VVe  simply  assume  that 
motion  parameters  can  be  obtained  using  a  small  fixation  patch.  As  shown  in  fig.  .'5-3. 
the  smallness  of  such  a  patch  translates  into  the  smallness  of  an  angle  a.  Numerous 


Figure  3-3:  A  schematic  interpretation  of  fixation  point  and  fixated  image  sequence. 

experimental  results  in  chapter  9  show  that  indeed  good  motion  estimates  are  obtained 
using  optimum  patch  sizes  with  a  field  of  view'  small  enough  to  justify  this  assumption. 

In  analogy  to  the  pure  translation  Ccise  of  [31],  we  can  find  the  translational  velocity 
t.  Equation  3.27  shows  that  1/(A  _  X)  =  — At  the  points  where  E]  is  very  small, 
even  a  small  error  in  computing  t  will  result  in  large  error  in  ]/(^  —  which 

translates  into  large  error  in  the  estimation  of  depth  Z.  Considering  this  fact,  the 
true  translational  velocity  t  can  be  found  from  eqn.  3.27  by  minimizing 

■’  =  //‘I 

It  IS  clear  that  at  the  fixation  point,  \.here  r  =  To  and  s  =  So  A’  =  ^So  and  for  the  points  near  the 
fixation  point  K  isi  —a. 
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with  respect  to  t.  In  other  words,  we  are  looking  for  the  true  motion  t  which  minimizes 
the  sum  of  squares  of  ^  over  the  fixation  patch.  Note  that  this  minimization  does 
not  force  Z  towards  Zo  because  at  Z  =  Z©  the  value  of  J  becomes  infinite. 

We  also  put  the  |(t||  =  1  constraint  on  this  minimization  problem  to  avoid  the 
trivial  solution  t  =  0.  This  is  a  valid  constraint  on  t  because  due  to  the  scale  factor 
ambiguity  we  can  only  find  the  direction  of  t.  This  constraint  on  t  can  be  written  as 

t^t=l.  (3.29) 


Moreover  we  can  rewrite  J  as 

J  =  t^A/t  (3.30) 

where  M  is  a  fully  computable  3x3  symmetric  matrix 

~  j  J  (3.31 ) 

Minimizing  J  in  eqn.  3.30  under  the  constraint  eqn.  3.29  is  an  ordinary  calculus 
constrained  minimization  problem  which  can  be  solved  by  minimizing 

/(t.  A)  =  t^A/t  +  A(1  -  t^t)  (3.32) 


with  respect  to  t  and  the  Lagrange  multiplier  A.  Then,  we  will  obtain 


^  =  2Mt  -  2At  =  0 
dt 


which  is  simplified  as 

Aft  =  At. 


(3.33) 


(3.34) 


Equation  3.34  is  an  eigenvalue  problem  where  A  is  an  eigenvalue  of  the  known  matrix 
M  and  t  is  the  corresponding  eigenvector.  The  eigenvalues  of  A/  are  real  and  nonnega- 
rive  becau.se  A/  is  a  positive  semidefinite  Hermitian  matrix.  Substituting  for  A/t  from 
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eqn.  3.34  into  eqn.  3.32  gives  /  =  A  which  implies  that  under  the  given  constraint. 
t^A/t  is  minimized  when  the  smallest  of  three  real  and  nonnegative  eigenvalues  is 
used  for  computing  the  eigenvector  t. 

It  is  shown  that  the  fixation  method  can  be  used  for  solving  the  motion  vision 
problem  in  its  general  case.  The  translational  velocity  t  is  obtained  from  eqn.  3.34 
by  using  the  smallest  eigenvalue  and  computing  its  corresponding  eigenvector.  Then 
we  can  use  eqn.  3.26  for  finding  the  depth  map,  a  depth  at  each  image  point,  as 


(s-t) 


(VxRo)t 

lllfoll 


Then.  eqn.  3.15  giv’es  the  partial  rotational  velocity  uj 


(3.35) 


=  “^Ro^o  +  ||Tj“|’i (t  X  Ro)  (3.36) 

where  ||Ro||  =  .^o||ro||  and  Zo  is  the  depth  at  the  fixation  point.  Appendix  C  intro¬ 
duces  a  technique  for  estimating  Zo. 

The  total  rotational  velocity  of  the  observer  relative  to  the  environment  is  obtained 
by  adding  ui  to  the  equivalent  rotational  velocity  il  given  in  chapter  5.  It  can  be  seen 
that  for  the  general  Ccise,  the  fixation  formulation  lets  us  find  the  shape  and  motion 
by  choosing  virtually  any  point  as  the  fixation  point. 


3.2.1  Special  cases:  t  is  zero  or  parallel  to  Ro 

When  the  translational  velocity  t  is  zero,  we  showed  that  the  partial  rotational  ve¬ 
locity  u  has  only  a  component  about  the  fixation  axis  Ro.  eqn.  3.24.  The  technique 
for  computing  this  component  of  rotational  velocity  is  given  in  chapter  4.  For  this 
special  case,  pure  rotation,  there  are  also  methods  for  finding  the  total  rotational 
velocity  using  the  initial  unfixated  images  [31j.  In  the  case  of  t  =  0.  we  basically 
cannot  obtain  any  estimation  for  the  depth  Z. 
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For  the  other  special  case  that  t  is  parallel  to  Ro.  we  substitute  for  uj  from  eqn.  3.24 
into  the  BC'CE  eqn.  2.8  to  obtain 

fj-l- ^(s-t)  =  0  (3.37) 

where  E[  is  again  a  notation  for  the  computable  term  Et  +  ujr.v  •  Ro.  Because  no 
approximation  is  involved  in  deriving  eqn.  3.37,  an  exact  closed  form  solution  exists 
for  t  and  Z  without  any  restriction  on  the  field  of  view  or  the  size  of  fixation  patch. 
This  exact  solution  for  finding  t  and  Z  is  the  same  as  the  solution  given  in  the  general 
case,  starting  from  eqn.  3.28,  except  that  J  is  defined  as  ff  Z^dxdy  for  this  special 
case. 


Computing  the 
and  Rotational 


Fixation  Velocity 
Component 


Chapter  4 


In  an  arbitrary  image  sequence,  a  point  chosen  as  the  fixation  point  does  not 
necessarily  stay  stationary  in  the  image  plane.  VJe  use  the  term  fixation  velocity  to 
refer  to  the  apparent  velocity  at  the  fixation  point  in  the  initial  1st  image.  As  shown 
in  fig.  4-1.  the  x  and  y  components  of  the  fixation  velocity  are  represented  by  Uo  and 
Vo  respectively. 

The  fixation  method  requires  a  sequence  of  two  fixated  images  in  which  the  fixation 
point  stays  stationary,  Tot  =  0.  A  fixated  image  sequence  can  be  obtained  by  first 
finding  Uo  and  Vo,  and  then  using  these  components  to  construct  a  new  image,  the 
fixated  2nd  image.  The  technique  for  the  construction  of  the  fixated  2nd  image  {pixel 
shifting  process)  is  explained  in  chapter  5. 

We  also  saw  that  the  component  of  the  rotational  velocity  along  the  fixation  axis, 
cannot  be  obtained  from  the  fixation  formulation  because  this  component  does 
not  move  the  fixation  point. 

In  this  chapter,  we  will  introduce  an  algorithm  for  obtaining  not  only  the  rotation 


'R„  but  also  the  components  of  the  fixation  velocity.  Uo  and  Vo. 
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Figure  4-1:  In  general,  for  any  point  chosen  as  the  fixation  point,  there  is  an  associated 
apparent  velocity  {fixation  velocity),  and  a  rotational  component  along  the  fixation 
axis,  The  components  of  fixation  velocity  are  shown  by  (uo.i’o)- 

4.1  Algorithm 

The  motion  field  velocity  due  to  the  rotational  velocity  component  is  given  by 
X  r)  =  — u;ii,(Ro  X  r)  =  x  r).  where  Ro  =  To  is  the  unit  vector 

along  the  fixation  axis  Tq.  Considering  a  small  patch  around  the  fixation  point,  and 
substituting  To  =  (xo  j/o  1)^  and  r  =  (x  y  1)^,  the  components  of  the  total  motion 
field  velocity  due  to  the  fixation  velocity  and  u;r„,  are  given  by 

xt  =  Uo-^x-(roxr)  =  Uo  +  -  yd 

yt  =  l-’o  -  ■  (fo  X  r)  =  I’o  -cJrJx  -  Xo) 

where  x  and  y  are  the  unit  vectors  along  the  x  and  y  axes  and  J-’r^  is  a  notation  for 
Substituting  for  x,  and  yt  from  the  above  equations  into  the  BC('E.  eqn.  2.7. 

gives 

[uo  +w’R,(j/  -  yo)\Er  +  [I'o  -  Xo)]Ey  -H  E,  =  0.  (1.21 
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Due  to  noise,  eqn.  4.2  does  not  necessarily  hold  for  any  point  (x.y).  Thus,  we  try 
to  find  Uo.i’o  and  by  minimizing  the  sum  of  squares  of  errors  over  the  fixation 
patch.  In  other  words  we  want  to  minimize 

J I [(Uo  +  '^R„(.V  -  yo))E:r  +  ( I’o  “  ■ro))Ey  +  E,Ydxdy  (4.3) 

with  respect  to  Uo.  I’o  and  This  results  in  a  system  of  three  linear  equations 

that  can  be  solved  for  the  three  unknowns 


On 

0,2 

021 

022 

031 

O32 

Matrix  A  is  symmetric  and  its  elements  are  given  by 


(4.4) 


ai2  =  ffEiEydxdy 

0,3  =  JJ  Er[Er{y  -  yo)  -  Ey{x  ~  Xo)]dxdy 
023  =  If  Ey[Eriy  -  yo)  -  Ey{x  -  Xo)]dx  dy  ^ 

o,,  =  jJEldxdy 

022  =  fJE^dxdy 

033  =  ff[Es{y-yo)-Ey{x-x^)Ydxdy 

and  the  components  of  vector  C  are  as  follows: 

c,  =  -fj  EtEidxdy 

C2  =  -IfEtEydxdy  (4.6) 

C3  =  -!!  Et{EAy  -  yo)  -  Ey{x  -  Xo)]dxdy. 

Considering  that  the  fixation  point  coordinates  Xo  and  yo  are  known,  the  sets  of 
equations  4. -3  and  4.6  show  that  the  elements  of  matrix  A  and  the  components  of 
vector  C  are  fully  computable. 
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4.2  Discussion 

When  the  spatio-temporal  gradients  are  zero,  matrix  ,4  is  irreversible  because  all  of  its 
elements  are  zero.  As  a  result,  we  will  not  be  able  to  compute  the  motion  components 
in  such  a  case.  Chapter  10  explains  how  to  avoid  this  by  an  autonomous  choice  of 
an  appropriate  fixation  point  that  is  not  located  in  a  patch  with  uniform  brightness. 
Furthermore,  for  implementation  we  make  sure  that  the  determinant  of  matrix  A  is 
nonzero  before  advancing  into  the  computations. 

In  the  special  case  where  the  fixation  point  is  at  the  principal  point.  Xg  =  po  —  0. 
elements  of  matrix  A  are  simplified  as 


ai2 

=  //  EiEydi  dy 

ai3 

=  //  Ei(yEi  -  iEy)dxdy 

<223 

=  II  Ey(yEr  -  xEy)dxdy 

<2ll 

=  IIEldxdy 

022 

=  II  Eydx  dy 

<233 

=  IliyE^  -  xEyfdxdy 

and  components  of  vector  C  are  given  as  follows 

Cl  =  -JJ  EtErdx  dy 

<  C2  =  -IfE.Eydxdy  (4.S) 

C3  =  -Jl  Et(yE^  -  xEy)dxdy. 

.After  finding  we  can  easily  compute  +  .  Clearly,  when 

the  fixation  point  is  at  the  principal  point,  becomes  equal  to  -r^. 

The  algorithm  given  in  this  chapter  hcis  been  successfully  implemented  on  real 
images  and  good  estimates  have  been  obtained  for  the  fixation  velocity  cumponents 
and  w;r„.  Chapter  8  describes  the  implementation  results. 


Constructing  a  Sequence  of 
Fixated  Images 

Chapter  5 


The  fixation  method  requires  a  sequence  of  two  images  in  which  the  fixation  point 
is  kept  stationary.  However,  the  input  can  be  an  arbitrary  sequence  of  two  images 
that  we  shall  call  the  1st  initial  and  2nd  initial  images.  The  1st  initial  image  is  used 
directly  as  the  1st  fixated  image  but  we  need  to  find  a  2nd  fixated  image  using  the 
2nd  initial  image. 

Physical  rotation  of  the  camera  relative  to  the  observer  base  is  a  hardware  solution 
to  this  problem  which  is  baisically  a  tracking  problem.  Considering  that  in  general 
the  interest  point  has  a  motion  relative  to  the  observer,  the  2nd  fixated  image  cannot 
be  obtained  in  one  step.  As  a  result,  a  feedback  control  loop  is  required  for  the 
camera  rotation  system  to  compensate  for  the  errors  resulting  from  the  new  position 
of  the  fixation  point.  This  tracking  approach  is  to  be  avoided  not  only  because  of  the 
potential  errors  involved  but  also  because  of  concern  about  real  time  applications. 

In  this  chapter,  we  will  show  how  a  2nd  fixated  image  can  be  constructed  by  a 
purely  software  technique,  the  pixel  shifting  process.  It  involves  applying  an  imaginary 
rotation  to  the  vision  system  and  determining  the  corresponding  transformation  which 
affects  the  2nd  initial  image. 
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5.1  Equivalent  Rotational  Velocity 

If  point  1  is  chosen  as  the  fixation  point  in  the  1st  initial  image,  then  in  general  its 
corresponding  image  point  in  the  2nd  initial  image  moves  to  a  new  location  such  as 
point  2:  see  fig.  o-I. 


Figure  5-1:  An  imaginary  rotation  opposite  to  the  equivalent  rotational  velocity.  —17. 
is  applied  to  the  vision  system  to  bring  point  2  to  point  1.  This  rotation  transforms 
the  2nd  initial  image  into  the  2nd  fixated  image. 

Determining  the  location  of  point  2  is  equivalent  to  the  estimation  of  the  fixation 
velocity.  Chapter  4  introduced  a  technique  for  the  estimation  of  the  fixation  velocity. 
The  experimental  results  in  chapters  8  and  9  will  also  show  that  the  fixation  velocity 
can  be  estimated  reliably  even  from  real  and  noisy  images.  As  a  result,  it  is  assumed 
here  that  the  fixation  velocity  has  been  already  computed  from  eqn.  4.4. 

There  are  infinite  combinations  of  translations  and  rotations  which  can  be  ap¬ 
plied  to  the  vision  system  or  camera  to  bring  the  image  point  at  2  to  the  location  1. 
.Among  all  these  combinations,  we  choose  to  accomplish  the  task  by  a  pure  rota- 


5.1:  Equivalent  Rotational  Velocity 
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tion.  To  find  the  desired  rotation,  we  first  introduce  an  equivalent  rotational  velocity, 
fl  =  (fix'  S'S  a  rotation  which  can  result  in  the  same  fixation  velocity  (Uo.  ^’o) 

at  the  fixation  point  (xo,yo)-  According  to  eqn.  2.6.  the  components  of  f)  must  satisfy 
the  following  set  of  equations 

Uo  =  Xoyo^x  -  +  +  J/ofi.- 

<  (5.1) 

=  {yi+l)ilx  -  -CoJ/ofiy  -  .Tofii. 

There  are  also  infinite  configurations  of  ft  that  satisfy  the  system  of  equations  in  5.1. 
However,  we  choose  the  only  one  that  does  not  introduce  any  new  rotational  velocity 
along  the  fixation  axis  To-  Mathematically  it  is  equivalent  to  having  ft  •  To  =  0  which 
results  in  an  extra  constraint  on  the  components  of  ft, 

Xoftx  +  J/ofty  +  =  0.  (5.2) 

This  constraint  guarantees  that  the  value  of  u;r,  obtained  by  applying  the  system  of 
equations  4.4  to  the  two  initial  images  is  also  valid  for  the  fixated  images.  As  a  result, 
no  adjustment  in  u;r^  is  needed  before  using  it  in  equations  3.35  and  3.36  which  must 
be  applied  to  a  sequence  of  fixated  images. 

Considering  that  the  fixation  velocity  (uo,  Vo)  and  the  fixation  point  coordinates 
Xo  and  j/o  are  known  here,  the  equivalent  rotational  velocity  ft  is  obtained  by  solving 
the  combination  of  three  linear  equations  in  5.1  and  5.2.  For  example,  in  the  case 
that  the  fixation  point  is  at  the  principal  point,  Xo  =  Vo  =  0.  the  equivalent  rotational 
velocity  becomes, 

ft  =  (t’o, -Uo,0).  (5.3) 

However,  it  should  be  emphasized  that  fixation  point  is  not  restricted  to  the  principal 
point  and  virtually  any  point  can  be  chosen  as  the  fixation  point. 


52 


Chapter  5:  Constructing  a  Sequencf  of  Fixated  Iwngts 


5.2  Constructing  the  2nd  Fixated  Image 

After  obtaining  the  equivalent  rotational  velocity  U,  the  task  of  constructing  the 
2nd  fixated  image  is  equivalent  to  finding  the  transformation  experienced  by  the  2nd 
initial  image  when  the  imaginary  rotation  —il  is  applied  to  the  vision  system. 

Considering  eqn.  5.1,  the  following  set  of  equations  give  the  component  of  the 
corresponding  shifting  vector  (u,  v)  for  any  pixel  (x,  y)  of  the  2nd  initial  image 

u  =  -xyQ^  +  (i2  +  l)nv  -  yf). 

(.J.4) 

V  =  +  xy^y  +  xQ;. 

Here  0.^:,  Qy  and  0.,  are  known  values.  As  a  result,  the  shifting  vector  (u.  v)  can  be 
obtained  for  every  pixel  of  the  2nd  initial  image. 

Figure  5-2  shows  the  process  of  constructing  the  2nd  fixated  image  using  the  2nd 
initial  image,  called  the  pixel  shifting  process.  The  brightness  at  pixel  (x.y)  of  the  2nd 


2nd  Initial  Image  2nd  Fixated  Image 

Figure  5-2:  The  pixel  shifting  process  (or  constructing  the  fixated  2nd  image  from  the 
2nd  initial  image. 

fixated  image  is  the  same  as  the  brightness  at  the  corresponding  point  {x  —  Tu.  y  —  Ti') 
in  the  2nd  initial  image,  where  T  is  the  time  interval  between  two  initial  images.  In 
general,  a  computed  original  point  is  not  located  at  the  center  of  a  pixel  in  the  2nd 


■5.2:  Constructing  the  2nd  Fixated  Image 
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initial  image.  As  a  result,  its  brightness  cannot  be  read  directly  from  the  image  file 
and  should  be  computed  by  averaging,  bilinear  interpolation  or  bicubic  interpolation 
of  the  brightnesses  at  its  neighboring  pixels. 

It  should  be  clear  by  now  that  we  neither  require  the  fixated  images  to  be  pro¬ 
vided  in  advance  nor  do  we  use  mechanical  tracking  for  obtaining  the  fixated  images. 
Construction  of  the  ‘2nd  fixated  image  is  based  on  the  pixel  shifting  process.  This  is 
done  entirely  in  software  and  no  tracking  is  involved  in  this  technique.  In  chapter  1 1. 
we  will  show  the  results  of  implementing  this  purely  software  based  technique  for 
constructing  a  sequence  from  fixated  images  for  several  real  image  sequences. 


An  Overview  of  the  Fixation 
Method 


Chapter  6 


The  algorithms  and  formulations  presented  in  the  previous  chapters  show  how  to 
solve  directly  for  the  motion  and  shape  in  the  general  case.  In  contrast  to  previous 
work  done  in  the  area  of  motion  vision,  our  technique  is  general  and  does  not  put 
any  severe  restrictions  on  the  motion  or  the  environment.  More  importantly,  the 
fixation  method  uses  neither  optical  flow  nor  feature  correspondence.  Instead,  image 
information  such  as  temporal  and  spatial  brightness  gradients  are  used  directly.  This 
method  neither  requires  tracked  images  as  input  nor  uses  tracking  for  obtaining  fixated 
images.  Instead,  it  introduces  a  pixel  shifting  process  for  constructing  fixated  images 
at  any  arbitrary  fixation  point.  This  process  is  done  entirely  in  software  without 
moving  the  camera  for  tracking. 

In  the  previous  chapters,  we  gave  the  theory  underlying  the  fixation  method  in 
detail.  This  chapter  presents  a  summary  of  the  main  steps  involved  in  the  fixation 
method. 

6.1  Main  Modules 

Figure  6-1  shows  a  block  diagram  of  the  ideas  behind  our  fixation  based  motion 
vision  system.  Referring  to  this  figure,  the  fixation  method  can  be  implemented  in 
the  following  steps; 

•  .Step  I:  Finding  the  fixation  rc/oci/y  components  (uo.  I’o)  and  the  component  of 
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Figure  6-1:  The  modules  of  the  fixation  based  general  motion  vision  system. 


rotational  velocity  along  Ro,  (^r„,  by  applying  the  system  of  eqn.  4.4  to  the  brightness 
gradients  from  two  initial  images. 

•  Step  2:  Knowing  the  fixation  velocity  components  (ue,  I’c.)  the  2nd  fixated  image 
is  constructed  by  the  pixel  shifting  process  explained  in  chapter  5.  This  is  done  entirely 
in  software  without  any  need  to  move  the  camera  for  tracking.  This  step  also  results 
in  the  estimation  of  the  equivalent  rotational  velocity  Q. 

•  Step  3:  Knowing  wr,,  and  using  the  fixation  constraint  equation  3.15,  the  1st 
initial  image,  and  the  2nd  fixated  image,  the  method  presented  in  chapter  3  can  be 
used  for  recovering  the  translational  velocity  t,  the  partial  rotational  velocity  uj.  and 
the  depth  Z  at  all  image  points. 

•  Step  4.  The  total  rotational  velocity  u) tot  is  obtained  simply  by  adding  the  equiva¬ 
lent  rotational  velocity  from  equations  5.1  and  5.2.  to  the  partial  rotational  velocity 
ui  from  eqn.  3.15. 

In  the  following  chapters,  we  apply  our  fixation  based  motion  vision  system  to 
the  real  world  environment  to  recover  motion  and  shape  in  the  general  case.  .At 
every  step,  we  discuss  the  implementation  issues  and  introduce  practical  techniques 
for  dealing  with  them. 


Spatial  and  Temporal  Brightness 
Gradients 

Chapter  7 


Brightness  gradients  are  the  primary  source  of  information  for  direct  method  al¬ 
gorithms.  Appendix  B  describes  the  formulations  for  obtaining  spatial  brightness 
gradients  E*  and  Ey,  and  the  temporal  brightness  gradient  Et  from  a  sequence  of  two 
time  varying  images. 

This  chapter  applies  those  formulations  to  two  real  image  sequences  to  obtain  the 
corresponding  brightness  gradients.  Then,  we  will  introduce  a  technique  for  the  visual 
representation  of  the  brightness  gradients  and  finally,  we  will  study  those  representa¬ 
tions  to  explain  the  significance  and  characteristics  of  brightness  gradients. 

7.1  Visual  Representation 

Two  successive  frames  of  the  landscape  image  sequence  (taken  at  the  Imaging  Labo¬ 
ratory  of  Carnegie  Mellon  University)  are  shown  in  fig.  7-1.  These  are  8  —  bit  images 
but  the  last  two  digits  are  usually  too  noisy  to  be  reliable. 

The  true  motion  between  these  frames  is  a  combination  of  translation  and  rotation. 
The  real  rotation  is  0.3  deg  about  the  optical  axis  Z  and  the  real  translation  is  2  mm 
along  the  horizontal  axis  X . 

Using  the  formulation  in  appendix  B,  we  can  compute  the  brightness  gradients. 
The  corresponding  spatial  and  temporal  brightness  gradients  for  the  landscape  image 
sequence  are  shown  in  figures  7-2  and  7-3,  respectively. 
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Figure  7-4  shows  another  image  sequence  {cup  image  sequence)  used  in  the  experi¬ 
ments.  The  motion  between  these  successive  frames  is  a  dZ?  translation  of  (2.5.  0.  4) 
The  spatial  and  temporal  gradients  for  the  cup  image  sequence  are  shown  in  figures  7-5 
and  7-6,  respectively. 

In  these  maps,  larger  gradient  values  are  shown  brighter.  Such  gradient  maps 
suggest  a  way  of  visually  representing  the  brightness  gradients  which  renders  them 
more  intuitively  meaningful. 


7.2  Interpretation  and  Significance 

The  top  gradient  maps  in  figures  7-2,  and  7-5  show  that  horizontal  gradients  {Ej's) 
capture  the  vertical  lines  and  feature  in  the  images.  Similarly,  the  bottom  gradient 
maps  in  these  figures  demonstrate  that  vertical  gradients  (Ey's)  pick  up  the  horizontal 
lines  and  feature  in  the  image. 

These  experimental  results  show  that  the  spatial  gradients  capture  the  geometric 
and  shading  characteristics  of  the  images.  It  is  important  to  notice  that  the  compu¬ 
tation  behind  spatial  gradients  is  very  simple.  However,  they  indirectly  capture  the 
edges,  features,  and  boundaries  of  the  scene. 

The  temporal  brightness  gradient  in  fig.  7-3  tells  us  about  the  motion  between 
two  landscape  images.  First  of  all,  the  vertical  lines  and  features  are  seen  all  over  this 
temporal  gradient  map.  This  observation  indicates  that  the  motion  has  a  horizontal 
translation  component. 

Secondly,  there  are  also  horizontal  lines  in  this  gradient  map  but  they  become 
weaker  as  they  get  close  to  the  left  side  of  the  map  (this  argument  becomes  more 
obvious  if  one  compares  the  horizontal  lines  in  here  with  those  of  Ey  in  fig.  7-2).  This 
means  that  motion  has  a  rotational  component  which  is  centered  in  the  left  side  of 
the  image.  In  section  13.2.  we  will  show  that  this  is  really  the  case. 

.Also,  we  can  observe  that  at  any  vertical  stripe  of  the  spatial  gradient  map. 


l.S:  Summary 
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the  horizontal  lines  become  stronger  as  their  distance  from  the  center  of  the  stripe 
increases.  This  observation  indicates  that  the  rotation  center  is  located  in  the  middle 
of  the  image. 

Figure  7-6  shows  that  the  temporal  brightness  gradient  map  captures  the  vertical 
edges  and  features  in  the  cup  image  sequence.  The  uniform  strength  of  the  vertical 
lines  in  fig.  7-6  is  an  indication  of  the  fact  that  the  motion  in  the  cup  image  sequence 
is  a  pure  horizontal  translation. 

7.3  Summary 

The  gradient  maps  and  discussions  presented  in  this  chapter  show  that  the  spatial 
gradients  capture  the  geometric  and  shading  characteristics  of  the  images  and  the 
temporal  gradients  contain  important  information  about  the  motion. 

As  shown  in  appendix  B,  the  computational  procedure  behind  gradient  estimation 
is  very  simple.  In  fact,  it  only  involves  the  subtraction  of  neighboring  pixel  values. 
Such  a  simple  computation  indirectly  results  in  capturing  the  motion  and  detecting 
the  features,  edges,  and  boundaries  in  the  images. 

However,  we  should  emphasize  that  we  neither  intended  to  obtain  such  edges  and 
features  nor  did  we  use  such  representation  of  the  gradient  maps  in  our  algorithms. 
The  intention  was  to  demonstrate  that  the  brightness  gradient  maps  not  only  contain 
the  motion  information  (which  is  usually  represented  by  the  optical  flow  maps)  but 
also  have  a  flavor  of  features  and  edges  (used  in  edge  maps  and  feature  correspondence 
algorithms). 
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Figure  7-1:  The  first  and  second  frames  in  the  landscape  image  seqiieme.  The  true 
motion  is  a  0.3  deg  rotation  about  the  nominal  optical  axis  Z.  and  a  2  inin  translation 
along  the  horizontal  axis  .V. 


Figure  7-2:  The  visual  representation  of  the  spatial  brightness  gradients  for  the  land¬ 
scape  image  sequence  in  the  horizontal  direction  (top)  and  vertical  direction  (bottom), 
and  Ey.  The  horizontal  gradient  map  (top)  has  captured  the  vertical  edges  and 
features  in  the  image.  Similarly,  the  vertical  gradient  map  (bottom)  has  picked  up 
the  horizontal  edges  and  features. 


Figure  7-3:  The  visual  representation  of  the  temporal  brightness  gradient  for  the 
landscape  image  sequence,  Et.  The  vertical  edges  with  relatively  uniform  strength 
suggest  that  motion  hcis  a  horizontal  translation  component.  The  horizontal  edges 
with  decreasing  strength  towards  left  indicate  that  there  is  also  a  rotation  centered 
at  the  left  of  the  image  center. 


l.S:  Summary 
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Figure  7-4:  The  first  and  second  images  in  the  cup  image  sequence.  The  true  motion 
between  these  frames  is  a  3D  translation  of  (2.5,  0,  4)  mm. 
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Figure  7-5:  The  visual  representation  of  the  spatial  brightness  gradients  for  the  cup 
images  in  the  horizontal  direction  (top)  and  vertical  direction  (bottom),  and  Ey. 
The  horizontal  gradient  map  (top)  has  captured  the  vertical  edges  and  features  in  the 
image.  Similarly,  the  vertical  gradient  map  (bottom)  heis  picked  up  the  horizontal 
edges  and  features. 


7.3:  Summary 
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Figure  7-6:  The  visual  representation  of  the  temporal  brightness  gradient  for  the  cup 
image  sequence,  Et-  The  presence  of  relatively  uniform  vertical  edges  and  features 
here  indicates  that  the  motion  is  predominantly  a  horizontal  translation 


The  Effect  of  Fixation  Patch  Size 

Chapter  8 


Finding  the  fixation  velocity  (velocity  at  the  fixation  point),  and  the  component 
of  rotational  velocity  about  the  fixation  axis,  is  the  most  important  part  of 

our  fixation  based  method  for  recovering  the  shape  and  motion  from  an  arbitrary 
sequence  of  input  images.  This  is  because  in  our  method  a  pixel  shifting  process 
uses  the  fixation  velocity  to  construct  a  sequence  of  fixated  images  from  an  arbitrary 
sequences  of  input  images  (chapter  5).  We  also  need  for  computing  the  total 
rotational  velocity  (chapter  3). 

In  chapter  4  we  introduced  the  algorithms  for  recovering  the  fixation  velocity  and 
WR,  using  the  information  from  the  fixation  patch  (an  image  patch  around  the  fixation 
point).  In  this  chapter,  we  study  the  effect  of  the  fixation  patch  size  on  the  estimation 
of  the  desired  motion  parameters  using  two  different  sequence  of  images  where  the 
motion  is  a  combination  of  translation  and  rotation. 

8.1  Images  with  Moderate  Relative  Depth  Changes 

Here,  we  have  used  a  sequence  of  real  images  acquired  at  the  Imaging  Laboratory  of 
Cartiegie  Mellon  University.  Figure  7-1  shows  two  of  these  576  x  384  pixels  images. 
The  relative  depth  is  moderate  (1250  mm  to  1625  mm,  about  30%  change)  in  the 
image  portion  used  i;.  our  computations.  The  camera  has  a  nominal  focal  length  of 
24  mm,  and  a  pixel  size  of  0.02  x  0.02  mm.  The  calibrated  principal  point  has  been 
used  as  a  fixation  point.  The  calibration  technique  is  explained  in  section  13.1. 
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In  a  raster  format  system  (origin  at  the  top  left  corner  of  the  image),  the  calibrated 
principal  point  is  located  near  the  center  of  image,  pixel  (27o.  205).  The  frontal  depth 
of  this  point  is  about  1450  mm. 

The  real  motion  between  these  two  images  has  both  translational  and  rotational 
components.  The  real  rotation  is  — O.d  deg  about  the  optical  axis  Z  and  the  real 
translation  is  —2  mm  along  the  horizontal  axis  X.  Testing  our  algorithms  using 
such  real  images  is  valuable  because  the  observed  motion  is  relatively  large  (more 
than  subpixel  motion  in  the  image  plane).  For  very  large  motions  it  is  enough  to 
use  higher  frame  grabbing  rates.  These  days,  there  are  commercially  available  frame 
grabbers  which  are  capable  of  capturing  up  to  7,500  frame  per  second  at  12  bits  gray 
scale  resolution  on  personal  computers  [82]. 

Using  the  algorithm  described  in  chapter  4  we  can  find  the  horizontal  and  vertical 
translations  and  the  rotational  component  for  any  given  fixation  patch  size.  The 
corresponding  plots  are  shown  in  figures  8-1,  8-2  and  8-3.  It  is  evident  that  these 
estimations  strongly  depend  on  the  fixation  patch  size  especially  when  the  fixation 
patch  is  small.  Figure  8-1  shows  that  the  horizontal  translation  converges  to  its  real 
value  (—2  mm).  On  the  other  hand,  the  vertical  translation  (fig.  8-2)  converges  to 
0.9  mm  which  is  not  its  true  value.  The  reason  for  this  disparity  is  described  in 
section  13.2. 

Figure  8-3  shows  that  for  small  patch  sizes  (less  than  30  x  30  pixels  in  this  case)  the 
estimated  value  for  oscillates  wildly  and  results  in  unacceptable  values.  .As  the 
patch  size  increases,  the  estimated  wr,  converges  towards  the  real  value  of  rotation. 
For  large  patch  sizes  (around  100  x  100  pixels  in  this  case)  the  estimated  rotation, 
—0.309  deg,  becomes  roughly  the  same  as  the  real  rotation,  —0.3  deg. 
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Figure  8-1:  The  estimated  value  for  the  horizontal  translation  versus  the  fixation 
patch  size  for  the  landscape  image  sequence.  The  true  horizontal  translation  is 
—2  mm. 

8.2  Images  with  Significant  Relative  Depth  Changes 

In  this  section  we  will  study  another  image  sequence  (cup  images)  which  have  consid¬ 
erable  relative  depth  changes  within  the  fixation  patch  (584  mm  to  914  mm,  about 
60%  difference).  Figure  7-4  shows  two  of  these  227  x  280  pixels  images  (cup  images). 

The  real  motion  of  the  viewer  is  a  horizontal  translation  of  2.5  mm  to  the  right. 

The  camera  has  a  nominal  focal  length  of  18.66  mm,  pixel-width  of  0.032  mm,  and 
pixel-height  of  0.029  mm.  We  have  used  the  nominal  principal  point  (image  center) 
as  our  fixation  point. 

Figure  8-4,  shows  the  estimates  for  the  horizontal  translation,  vertical  translation, 
and  the  rotational  velocity  component  wr,.  It  is  obvious  that  the  estimated  values 
depend  strongly  on  the  size  of  the  fixation  patch.  We  can  find  good  estimates  for 
these  motion  parameters  if  we  use  the  right  fixation  patch  size. 
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Figure  8-2:  The  estimated  value  for  the  vertical  translation  versus  the  fixation  patch 
size  for  the  landscape  image  sequence.  The  true  vertical  translation  is  zero  which  is 
apparently  different  from  the  experimental  results  (about  %  —0.9  mm).  In  chapter  13. 
we  will  show  that  this  considerable  difference  is  due  to  a  calibration  problem. 

8.3  Finding  a  Good  Estimate  for  a;R^  Autonomously 

It  can  be  seen  that  the  size  of  fixation  patch  has  a  critical  effect  on  the  estimated 
values  of  the  component  of  rotational  velocity  about  the  fixation  axis.  u;r„.  A  small 
patch  size  results  in  a  value  for  wr,  which  is  usually  far  distant  from  the  true  value. 
This  is  possibly  because  in  a  small  patch,  small  translations  can  be  interpreted  as 
large  rotations.  Figure  8-5  shows  a  hypothetical  situation  where  (a)  and  (6)  are  a 
sequence  of  a  small  3x3  pixels  patch.  The  real  motion  in  this  case  is  most  likely 
a  pixel  high  vertical  translation.  But  if  we  try  to  interpret  it  as  a  rotation  about 
the  patch  center  we  will  end  up  with  a  45  deg  rotation  which  is  not  acceptable, 
considering  the  assumed  small  motion  between  images. 

As  a  conclusion,  we  can  autonomously  find  a  good  estimate  for  the  rotational 
velocity  component  u;r„  simply  by  using  a  relatively  large  fixation  patch  size. 
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Figure  8-3;  The  estimated  value  of  the  component  of  rotation  velocity  about  the 
fixation  axis,  ujr,  ,  versus  the  fixation  patch  size  for  the  landscape  image  sequence.  For 
large  patch  sizes,  the  estimated  value  of  u;r.„  (about  —0.309  deg)  converges  towards 
the  real  value  of  —0.3  deg. 


8.4  Updating  the  Fixation  Velocity  Using  a;R„ 

In  the  previous  section,  we  saw  that  a  good  estimate  for  can  be  found  using  a 
relatively  large  patch  but  the  corresponding  fixation  velocity  estimate  from  such  a 
large  patch  is  usually  not  reliable.  This  observation  suggests  that  we  may  be  able  to 
obtain  better  estimates  for  the  fixation  velocity  components  if  we  use  the  estimated 
value  of  u)R,  and  recompute  the  fixation  velocity. 

Using  only  the  estimate  for  cjr,  from  a  large  patch,  we  can  compute  the  total 
motion  field  at  any  point  (x,y)  on  a  small  patch  around  the  fixation  point  {fixation 
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Figure  8-4:  The  estimated  values  for  the  horizontal  and  vertical  translations  and  the 
rotational  component  wr,  versus  the  fixation  patch  size  for  the  cup  image  sequence. 
The  true  motion  is  a  horizontal  translation  of  2.5  mm. 


patch).  As  we  showed  in  chapter  4 
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where  (xo,  Vo)  is  the  position  of  fixation  point  (located  in  the  image  plane),  and 
(uo,Uo)  is  the  fixation  velocity  that  we  are  about  to  estimate.  After  substituting  X( 
and  yt  into  the  BCCE,  eqn.  (2.7),  we  will  have 


WR, 


+  + 1 


{y  -  j/o) 


Ez-\- 


yjx'l  -I-  -I- 1 


(x 
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(8.2) 


However,  due  to  noise,  the  above  equation  does  not  necessarily  hold  for  any  pixel.  As 
a  result,  we  can  find  Uo  and  Vo  by  minimizing  the  sum  of  the  errors  over  the  whole 
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Figure  8-5:  Using  small  fixation  patch  can  result  in  wrong  interpretation  of  large 
rotation.  In  a  patch  of  3  x  3  pixels,  a  pixel  high  vertical  translation  can  be  interpreted 
as  45  deg  rotation  which  is  not  an  acceptable  answer  at  all,  considering  the  finite 
motion  between  images. 


fixation  patch,  namely  by  minimizing 
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with  respect  to  Uq  and  Vq.  This  will  result  in  the  following  system  of  linear  equations. 
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that  can  be  solved  for  the  two  unknowns  u,,  and  Vo-  Note  that  heis  been  already 
computed  and  is  a  known  value  in  this  equation. 
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8.4.1  Improved  estimations 

Here,  we  use  the  updated  algorithms  (which  take  advantage  of  a  good  estimation) 
to  find  estimations  for  the  translational  components  of  the  fixation  velocity. 

Figures  8-6  and  8-7  compare  the  updated  and  previous  estimations  of  the  hurizmi- 
tal  and  vertical  translations  in  the  landscape  images.  These  figures  show  that  there 
are  some  improvements  in  the  updated  estimations  especially  for  the  vertical  transla¬ 
tion  (fig.  8-7).  The  improvements  in  the  updated  estimations  are  more  pronounced 


Figure  8-6:  The  updated  and  previous  estimations  of  the  horizontal  translation,  along 
the  X-axis,  versus  the  fixation  patch  size  for  the  landscape  image  sequence. 

in  the  plots  corresponding  to  the  cup  images  (figures  8-8  and  8-9).  Note  that  we  have 
better  improvements  where  there  is  the  most  need  for  it,  namely  in  the  cup  images 
where  relative  depth  variations  is  large  compared  to  the  landscape  images. 

Despite  improvements,  the  dependency  of  the  updated  translational  components 
on  the  fixation  patch  size  is  still  quite  clear  in  these  figures.  However,  we  can  find  good 
estimates  for  these  motion  parameters  if  we  choose  the  right  fixation  patch  size.  In 
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Figure  8-7:  The  updated  and  previous  estimations  of  the  vertical  translation,  along 
the  V'-axis,  versus  the  fixation  patch  size  for  the  landscape  image  sequence. 

practice,  we  do  not  know  the  real  fixation  velocity,  and  therefore  we  cannot  select  an 
appropriate  fixation  patch  size  by  checking  the  computed  values  of  the  translational 
components.  The  next  chapter  introduces  a  technique  for  autonomous  choice  of  an 
optimum  fixation  patch  size. 
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Figure  8-8:  The  updated  and  previous  estimations  of  the  horizontal  translation,  along 
the  A’-axis,  versus  the  fixation  patch  size  for  the  cup  image  sequence. 


Autonomous  Choice  of  an 
Optimum  Fixation  Patch  Size 

Chapter  9 


The  experimental  results  and  explanations  in  the  previous  chapter  suggest  that 
relatively  large  patch  sizes  should  be  used  in  order  to  get  a  good  estimate  for  the 
component  of  the  rotation  along  the  fixation  axis,  u;r,.  On  the  other  hand,  we  know 
that  in  general  using  a  very  large  patch  size  will  result  in  a  wrong  estimate  for  the 
fixation  velocity  because  depth  variations  usually  increase  as  the  patch  size  increases. 

Figures  8-1  and  8-4  showed  that  for  any  image  sequence,  there  is  an  optimum 
patch  size  which  results  in  good  estimates  for  the  fixation  velocity  components.  The 
corresponding  optimum  patch  size  is  about  100  x  100  pixels  for  the  landscape  image 
sequence  (fig.  8-1)  and  about  50  x  50  pixels  for  the  cup  image  sequence  (fig.  8-4). 

In  this  chapter,  we  will  describe  an  autonomous  technique  for  finding  the  optimum 
fixation  patch  size  which  results  in  good  estimates  for  the  fixation  velocity  components 
for  any  image  sequence. 

9.1  Normalized  Error 

We  showed  that  for  any  given  size  of  the  fixation  patch,  we  can  find  the  fixation  veloc¬ 
ity  components,  Uo  and  Vg.  Also  the  component  of  the  rotational  velocity  about  the 
fixation  axis,  can  be  estimated  reliably  using  a  relatively  large  patch.  Knowing 
these  values,  the  motion  field  velocity  (xt,yt)  at  any  point  (x,y)  in  the  image  plane  is 
given  by  eqn.  8.1.  Ideally,  for  any  given  image  point  (x,y)  the  BCCE,  eqn.  2.7,  must 
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be  satisfied.  However,  in  practice  we  are  dealing  with  real  images  which  are  noisy  and 
as  a  result,  the  term  XtEx  +  VtEy  +  Et  does  not  usually  become  zero.  This  term  can 
be  considered  as  an  error  term  for  the  corresponding  pixel.  In  a  patch  of  size  p  x  p 
pixels,  we  can  add  these  error  terms  to  define  the  normalized  error,  e.  as 

^  _  Zl^tExA-  ytEy  +  EtY 

This  definition  allows  us  to  compare  the  performance  of  different  patch  sizes  by  study¬ 
ing  the  behavior  of  the  normalized  error  e  with  respect  to  the  changes  in  the  patch 
size  p. 


9.2  Optimum  Patch  Size 

In  this  section,  we  show  how  the  normalized  error  can  be  used  for  finding  an  optimum 
patch  size  which  results  in  good  estimates  for  the  components  of  the  fixation  velocity. 
Any  patch  of  a  real  image  may  include  a  substantial  depth  range.  In  general,  there  are 
two  main  groups  of  images.  In  the  first  group,  there  are  moderate  changes  in  depth 
variation  as  the  patch  size  increases.  The  second  group  represents  images  where  the 
depth  variation  increases  significantly  as  the  patch  size  increases. 

9.2.1  Moderate  changes  in  relative  depth 

Figure  9-1  shows  the  normalized  error  versus  the  fixation  patch  size  for  the  landscape 
image  sequence.  Although  this  plot  corresponds  to  a  specific  image  and  motion,  it 
shows  one  of  the  two  typical  representations  of  the  normalized  error  behavior  as  the 
patch  size  increaises.  As  shown  in  this  figure,  the  normalized  error  first  increcises  with 
the  patch  size,  reaches  a  peak  and  then  dips  down. 

This  is  because  initially  for  the  smallest  patch  size  (3  x  3  pixels)  the  algorithm 
finds  the  motion  estimates  that  makes  the  BCCE  error  term  (xtEx  +  PiEy  -t-  Ei)  as 
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Figure  9- 1 ;  The  estimated  value  of  the  normalized  error  e  versus  the  fixation  patch  size 
for  the  landscape  image  sequence.  The  optimum  patch  size  occurs  around  100  pixels. 

small  as  possible.  The  algorithm  does  a  good  job  in  minimizing  the  total  of  9  error 
terms  in  this  small  patch  but  the  motion  estimates  are  usually  very  bad  at  this  level 
because  basically  there  are  not  enough  data  available  to  the  algorithm. 

In  the  next  level,  we  have  a  patch  of  5  x  5  pixels  size  which  provides  more  data. 
While  there  is  still  not  enough  data  for  the  algorithm  to  come  up  with  good  motion 
estimates,  it  finds  parameters  which  minimize  the  sum  of  the  BCCE  error  terms. 
However,  the  algorithm  is  not  usually  as  successful  «is  it  was  for  the  3x3  pixels  patch 
size  because  it  has  to  deal  with  more  error  terms  and  this  results  in  higher  normalized 
error. 

As  we  increeise  the  patch  size,  the  struggle  between  providing  more  data  to  the 
algorithm  and  satisfying  more  error  terms  continues.  For  relatively  small  patch  sizes, 
this  results  in  higher  normalized  error.  The  normalized  error  increases  until  it  reaches 
a  peak  where  the  role  of  extra  input  data  becomes  more  important  than  satisfying 
more  error  terms.  Then  by  increasing  the  patch  size,  we  are  providing  more  data 
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to  the  algorithm  and  this  gives  a  better  motion  estimate  and  results  in  a  smaller 
normalized  error. 

•After  dipping  down,  the  normalized  error  stays  roughly  the  same  in  this  ease, 
because  the  relative  depth  variation  does  not  change  much  with  the  patch  size.  (fig.  9- 
1).  The  optimum  patch  size  in  this  example  occurs  around  100  x  100  pixels  which 
corresponds  to  the  start  of  the  small  slope  in  normalized  error,  a  roughly  flat  portion 
after  the  first  peak.  In  this  example,  relative  depth  changes  are  moderate  (12.o0  mm 
to  1625  mm,  about  30%  difference)  and  stay  roughly  the  same  as  the  patch  size 
increases. 


9.2.2  Significant  changes  in  relative  depth 

The  normalized  error  for  the  cup  image  sequence  is  shown  in  fig.  9-2.  As  before,  the 


Figure  9-2:  The  normalized  error  versus  fixation  patch  size  for  the  cup  image  sequence. 
The  optimum  patch  size  occurs  around  .50  pixels. 

normalized  error  first  increases  and  after  reaching  a  peak  it  dips  down  and  then  grows 
with  the  patch  size  again.  This  is  because  in  the  beginning,  insufficient  information 
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results  in  extremely  wrong  estimates  and  this  causes  the  normalized  error  to  increase 
I  with  the  patch  size.  As  we  are  providing  more  and  more  data  to  the  algorithm,  we 

obtain  better  estimates  for  the  motion  components  and  this  decreases  the  normalized 
error.  If  we  increase  the  patch  size  beyond  an  optimum  size,  which  occurs  at  about 
oO  pixels  in  this  example,  the  normalized  error  starts  increasing  again.  In  this  50  x  50 
pixels  patch,  we  have  a  considerable  amount  of  relativ'e  depth  changes  (from  584  mm 
I  to  914  mm,  about  57%  increase).  Such  significant  relative  depth  variation  leads 

to  larger  errors  in  the  fixation  velocity  estimates  which  in  turn  results  in  a  larger 
normalized  error  as  p  grows. 


9.3  Autonomous  Choice  of  Optimum  Patch  Size 

As  one  might  expect,  the  optimum  fixation  patch  size  depends  on  the  patch  topology 
and  texture  which  may  vary  from  image  to  image.  However,  the  general  pattern  of 
the  normalized  error  allows  us  to  autonomously  find  an  optimum  fixation  patch  size 
which  gives  good  estimates  for  the  fixation  velocity  components. 

In  the  case  where  considerable  changes  in  the  relative  depth  occur  with  patch  size 
increase,  as  in  the  cup  image  sequence,  the  optimum  fixation  patch  size  corresponds 
to  the  minimum  normalized  error  that  occurs  after  the  peak  value  of  the  normalized 
error.  And  in  Ccises  where  the  relative  depth  does  not  change  significantly  with  patch 
size,  ais  in  the  landscape  image  sequence,  the  optimum  fixation  patch  size  is  where 
the  normalized  error  does  not  change  considerably  <is  the  patch  size  increases. 

A  human  operator  may  not  have  much  problem  identifying  the  optimum  patch 
size  on  the  normalized  error  plots.  But  our  aim  is  to  come  up  with  a  simple  algorithm 
which  allows  a  machine  to  autonomously  find  the  optimum  patch  size  from  any  given 
normalized  error  data  set. 
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9.3.1  Algorithm 


This  section  describes  the  algorithm  for  obtaining  the  optimum  fixation  patch  size 
from  any  normalized  error  data  set.  The  general  algorithm  is  composed  of  the  fol¬ 
lowing  steps: 

•  Step  1:  Setting  the  patch  size  bounds 

All  the  experimental  results  unanimously  show  that  the  motion  estimates  from  a 
small  patch  are  not  reliable  at  all.  As  a  result,  we  can  put  a  lower  bound  on  the  patch 
size.  By  taking  into  account  the  camera  parameters  and  the  image  size,  we  have  used 
a  15  X  15  pixel  patch  as  the  lower  bound  of  the  patch  size.  Moreover,  the  square 
shape  of  the  patch,  the  location  of  the  fixation  point,  and  the  image  size  dictates  an 
upper  bound  on  the  patch  size.  As  a  result,  we  have  used  140  x  140  pixels  as  the 
upper  bound  in  our  experiments. 

•  Step  2:  Computing  the  normalized  error  slope 

Denoting  the  normalized  error  at  patch  i  as  e[i],  we  define  the  slope  at  patch  i  as 


5[el  = 


e[i  +  1]  —  e[i 
e\i] 


(9.2) 


The  slope  5[i)  is  dimensionless  and  shows  the  relative  change  of  the  normalized  error 
as  the  patch  i  changes  to  patch  i  +  1. 

•  Step  3:  Setting  a  slope  index 

By  searching  through  the  slope  space,  we  can  find  the  steepest  (most  negative) 
slope  and  denote  it  as  Smax.  This  definition  allows  the  algorithm  to  get  a  sense  of 
steepness  (or  flatness)  at  any  point  on  the  normalized  error  curve.  We  define  the  slope 
index  Sind  as  a  small  percentage  (about  15%)  of  the  steepest  slope  Smax.  .Study 
of  many  normalized  errors  plots  has  shown  that  this  choice  of  the  Sind  allows  us  to 
identify  relatively  flat  portions  in  a  typical  normalized  error  curve. 

•  Step  4:  5  earching  for  the  optimum  patch  size 

We  choose  the  lower  bound  patch  size  as  the  first  candidate  for  the  optimum  size. 
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Then,  we  move  to  the  next  patch  size  and  select  it  as  the  new  nominated  optimum 
patch  size  if  it  satisfies  the  following  two  conditions: 

-  First  condition:  Its  normalized  error  e[i|  should  be  less  than  the  normalized  error 
value  of  the  previously  nominated  optimum  point. 

-  Second  condition:  Its  corresponding  slope  should  be  steeper  (more  negative) 
than  the  slope  index,  Sind. 

We  continue  this  search  process  until  we  reach  the  upper  bound  of  the  patch  size. 

•  Step  5:  Locating  the  optimum  patch  size 

After  checking  all  the  data,  the  point  immediately  after  the  last  nominated  point 
is  selected  as  the  optimum  point. 


9.3.2  Experimental  results 

The  above  algorithm  has  been  applied  to  the  normalized  error  data  set  of  the  land¬ 
scape  and  the  cup  image  sequences  (figures  9-1  and  9-2)  to  obtain  the  optimum  patch 
sizes.  The  corresponding  experimental  results  of  locating  the  optimum  patch  size  are 
shown  in  figures  9-3  and  9-4.  In  these  figures,  the  nominated  optimum  points  are 
shown  by  small  circles  on  the  normalized  error  curves.  It  can  be  seen  that  for  both 
ca^es  the  algorithm  finds  the  optimum  points  correctly. 

Figure  9-3  shows  that  the  optimum  patch  size  for  the  landscape  image  sequence  is 
selected  at  101  pixels  which  corresponds  to  a  small  field  of  view  (about  2  x  2.4  deg). 
If  we  go  back  to  figures  8-6  and  8-7  again,  we  see  that  one  of  the  best  estimations 
for  the  translational  components  occur  at  this  optimum  patch  size  (101  pixels).  The 
optimum  patch  size  for  the  cup  image  sequence  is  selected  at  47  pixels  (fig.  9-2). 
.Similarly,  figures  8-8  and  8-9  show  that  we  obtain  one  of  the  best  combined  motion 
estimates  at  this  optimum  point  (47  pixels).  This  optimum  patch  size  for  the  cup  im¬ 
age  sequence  makes  approximately  the  same  field  of  view  as  the  one  for  the  landscape 
image  sequence  (about  2  x  2.4  deg).  This  is  an  important  observation  considering 
that  we  have  obtained  roughly  the  same  optimum  field  of  viewior  two  totally  different 
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Figure  9-3:  Searching  process  of  finding  the  optimum  patch  size  for  the  laridscape 
image  sequence.  The  nominated  points  are  shown  by  small  circles.  The  last  point 
represents  the  optimum  point  which  occurs  at  101  pixels  in  this  case. 

images,  camerais,  and  focal  lengths. 

9.3.3  Further  results 

In  order  to  test  our  algorithm  further,  we  have  run  it  on  many  other  image  sequences 
with  smaller  and  larger  motions.  The  algorithm  has  worked  successfully  in  finding 
the  optimum  patch  sizes  in  all  cases.  Some  of  the  corresponding  experimental  results 
are  shown  in  figures  9-5,  9-6,  9-7,  and  9-8.  These  experimental  results  for  the  other 
images  sequences  show  that  the  corresponding  optimum  patch  sizes  are  close  but  not 
necessarily  the  same  as  the  values  we  obtained  before.  However,  in  every  case  the 
obtained  optimum  point  represents  the  patch  size  which  results  in  one  of  the  best 
motion  estimates. 
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Figure  9-4;  Searching  process  of  finding  the  optimum  patch  size  for  the  cup  image 
sequence.  The  nominated  points  are  shown  by  small  circles.  The  last  point  represents 
the  optimum  point  which  occurs  at  47  pixels  in  this  case. 
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Figure  9-5:  Searching  process  of  finding  the  optimum  patch  size  for  the  landscape :20- 
30  image  sequence.  The  motion  is  two  times  «is  large  as  before  (—4  mm  translation 
and  —0.6  deg  rotation).  The  nominated  points  are  shown  by  small  circles.  The  last 
point  represents  the  optimum  point  which  occurs  at  101  pixels  in  this  ca^e. 
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Figure  9-6:  Searching  process  of  finding  the  optimum  patch  size  for  the  cup  13  image 
sequence.  The  motion  is  two  times  as  large  as  before  (—5  mm  translation).  The 
nominated  points  are  shown  by  small  circle.s.  The  last  point  represents  the  optimum 
point  which  occurs  at  39  pixels  in  this  case. 
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Figure  9-7:  Searching  process  of  finding  the  optimum  patch  size  for  the  landscape 20- 
25  image  sequence.  The  motion  is  —2  mm  translatir  a  and  —0.3  deg  rotation.  The 
nominated  points  are  shown  by  small  circles.  The  last  point  represents  the  optimum 
point  which  occurs  at  105  pixels  in  this  case. 


Figure  9-8:  Searching  process  of  finding  the  optimum  patch  size  for  the  cup-23  image 
sequence.  The  motion  is  —2.5  mm  translation.  The  nominated  points  are  shown  by 
small  circles.  The  last  point  represents  the  optimum  point  which  occurs  at  49  pixels 
in  this  case. 


Autonomous  Choice  of  an 
Appropriate  Fixation  Point 


Chapter  10 


In  general,  our  fixation  algorithms  do  not  place  any  restrictions  on  the  choice 
of  the  fixation  point  location  and  virtually  any  point  can  be  chosen  as  the  fixation 
point.  Among  all  points,  the  choice  of  principal  point  (image  center)  makes  the 
formulations  simpler.  However,  in  practice,  one  should  take  some  more  considerations 
into  account  while  choosing  an  appropriate  fixation  point.  Most  significantly,  the 
motion  of  the  chosen  fixation  point  should  be  detectable  using  the  information  from 
its  corresponding  patch.  To  clarify  this,  we  can  consider  a  patch  which  has  a  uniform 
brightness.  Choosing  the  center  of  such  a  patch  as  the  fixation  point  will  not  be 
useful,  because  the  motion  of  such  a  point  is  irrecoverable  using  only  the  information 
from  that  patch.  This  chapter  introduces  a  technique  for  autonomous  choice  of  an 
appropriate  fixation  point. 

10.1  Algorithm 

Similar  to  chapter  4  (when  using  u/r,  =  0),  the  least  squares  method  can  be  applied 
to  the  BCCE  terms  to  obtain  the  following  system  of  linear  equations  for  the  uniform 
motion  field  (u,v)  on  a  patch  as 

'  ff.Eldxdy  ff^E.Eydxdy 
ff^E^Eydxdy  ff.Eldxdy 
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-//p  EtE^dxdy 
-ffpEtEydxdy 


(10.1) 
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It  is  obvious  that  the  solution  for  (u.  v)  exists  (i.e.  motion  is  detectable)  if  the  deter¬ 
minant  of  the  above  matrix 

^  ~  UJp  UIp  ~  UIp  ^ 

is  not  zero.  However,  this  is  not  a  reliable  criteria  for  real  images  because  due  to  noise 
we  may  have  Z?  ^  0  but  it  does  not  guarantee  that  the  patch  is  an  appropriate  one. 

If  we  denote  the  smaller  eigenvalue  of  the  coefficient  matrix  in  eqn.  10.1  by  Aj. 

=  5  +  EDdidy  -  ^mEl  -  ElYdxdy  +  4(//,  E.E,didy)‘)  (10.3) 

then  we  can  define  a  good  fixation  point  as  a  point  whose  corresponding  patch  has 
the  largest  A,.  Using  such  a  patch  not  only  guarantees  a  solution  {D  ^  0)  but  also 
ensures  that  our  solution  (u,  v)  is  not  sensitive  to  noise  errors  in  the  coefficient  matrix 
of  eqn.  10.1. 

The  reasoning  behind  using  the  largest  A,  is  the  form  of  the  characteristic  poly¬ 
nomial  of  the  coefficient  matrix  in  10.1, 

F(A)  =  A^-2  Jj^El  +  El)dx  dy^  (//  Eldxdy^-i^j j  ErEydx  dy 

(10.4) 

When  A  is  large,  small  errors  in  the  coefficients  results  in  negligible  error  in  F(A) 
compared  to  the  case  when  A  is  small.  This  implies  that  in  patches  with  larger 
A,,  the  apparent  motion  components  (u,  u)  are  less  sensitive  to  small  errors  in  the 
coefficients  which  may  occur  due  to  noise. 


10.2  Discussion 


It  is  easy  to  implement  the  A,  criteria  for  autonomous  choice  of  a  good  fixation  point. 
This  criteria  results  in  reliable  choices  for  the  fixation  point  even  in  real  noisy  images. 
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For  patches  with  relatively  uniform  brightness  the  is  small  which  means  that  we 
should  avoid  choosing  the  fixation  point  in  such  a  patch.  We  will  get  larger  and  larger 
A,'s  as  we  choose  patches  with  more  features  and  brightness  variations. 

We  have  addressed  the  question  of  finding  an  appropriate  fixation  point  (the  center 
of  a  fixation  patch)  among  a  number  of  given  patches.  But  which  patches  should  we 
check  in  the  first  place?  We  can  search  the  whole  image  for  a  globally  optimum 
location  of  a  fixation  point  in  the  following  steps: 

•  Step  1:  Divide  the  whole  image  into  4  quadrants  and  find  the  corresponding  A, 
for  each  quadrant. 

•  Step  2:  Use  the  quadrant  with  the  largest  A,  as  a  new  base  image. 

•  Step  3:  Repeat  steps  1  &  2  until  reaching  a  quadrant  with  an  acceptable  size. 

However,  performing  such  a  comprehensive  search  may  not  always  be  necessary. 

Instead,  we  can  check  a  limited  number  of  neighboring  patches  (near  the  principal 
point,  for  convenience)  and  choose  the  center  of  the  one  with  the  largest  A,  as  the 
fixation  point. 


Tracking  without  Moving  the 
Camera 

Chapter  11 


The  fixation  method  requires  a  sequence  of  fixated  (tracked)  images  as  its  input. 
However,  in  general  the  acquired  image  sequences  may  not  be  fixated  at  any  point 
and  even  if  they  are  it  is  not  easy  to  find  that  fixation  point. 

Our  fixation  method  does  not  depend  on  how  the  fixated  images  are  obtained. 
But  along  the  course  of  this  thesis  work,  we  were  confronted  with  the  challenge  of 
constructing  a  sequence  of  fixated  (tracked)  images  from  an  arbitrary  image  sequence. 

This  chapter  describes  the  experimental  results  and  the  implementation  issues  in¬ 
volved  in  constructing  sequences  of  fixated  images  from  several  real  images  sequences. 

11.1  Background 

The  task  of  constructing  a  sequence  of  fixated  images  is,  in  essence,  the  well  known 
tracking  problem.  People  have  been  working  on  different  aspects  of  this  problem  using 
various  techniques  for  many  years  [43,  22,  53].  For  example,  Aloimonos  &  Tsakiris 
[5]  propose  a  method  for  tracking  a  foveated  target  of  known  shape;  Bandopadhay  et 
al.  [10]  use  optical  flow  and  feature  correspondence  for  tracking  the  principal  point 
in  order  to  find  the  motion  in  a  special  case  (they  assume  that  there  is  no  rotation 
along  the  optical  axis)  without  considering  noise;  and  Sandini  &  Tistarelli  [52]  use 
an  optical  flow  b2ised  tracking  method  for  finding  the  depth  in  a  special  c«ise  (no 
rotation  along  the  optical  axis).  All  these  methods  use  optical  flow  and/or  feature 
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correspondence  and  address  only  special  cases.  There  has  also  been  some  work  (Jii 
using  visual  tracking  for  finding  the  trajectory  of  an  object  moving  in  an  environment 
[15.  90]. 

Traditionally,  tracking  has  been  associated  with  mechanically  moving  the  camera 
to  keep  the  image  of  a  particular  point  stationary  at  the  image  center.  Some  tech¬ 
niques  even  rely  on  such  a  system.  For  e.xample,  Thompson  [74]  introduces  an  optical 
flow  method  for  recovering  the  motion  in  special  case  where  the  rotational  velocity 
along  the  optical  axis  is  zero.  His  method  requires  a  sequence  of  tracked  images  at  the 
principal  point  but  he  acknowledges  that  the  actual  implementation  of  such  tracking 
requirement  in  engineering  systems  is  not  possible  yet. 

Hardware  tracking  is  done  by  physically  moving  the  camera  with  respect  to  the 
environment.  Considering  that  in  general  the  point  of  interest  has  a  motion  relative 
to  the  observer,  the  2nd  fixated  image  cannot  be  obtained  in  one  step.  .\s  a  result, 
feedback  control  loop  is  required  for  the  camera  rotation  system  to  compensate  for 
the  errors  resulting  from  the  new  position  of  the  fixation  point  [46,  20,  24.  -il.  S9. 
19].  These  difficulties  and  other  problems  such  as  expense,  real  time  response,  and 
potential  errors  involved  make  mechanical  tracking  unattractive  especially  for  our 
vision  system. 


11.2  Pixel  Shifting  Process 

Here,  we  use  the  pixel  shifting  process  described  in  chapter  5  for  constructing  a  se¬ 
quence  of  fixed  images  from  an  arbitrary  image  sequence.  This  method  solves  the 
tracking  problem  in  its  most  challenging  c«ise.  In  other  words,  it  does  not  require 
any  knowledge  about  the  motion  or  shape.  Furthermore,  the  fixation  point  is  not 
restricted  to  the  principal  point  (image  center)  and  virtually  any  point  can  be  chosen 
as  the  fixation  point.  The  pixel  shifting  process  is  done  purely  in  software  without  any 
need  to  mechanically  move  the  camera  for  tracking.  It  is  computationally  simple  and 
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uses  neither  optica!  flow  nor  feature  correspondence.  Instead,  brightness  gradients  of 
the  initial  input  images  are  used  directly. 


11.2.1  Bilinear  Interpolation 

We  showed  that  constructing  a  fixated  image  is  the  same  finding  the  brightness  E 
for  any  pixel  {x,y)  of  such  an  image,  (see  chapter  5).  We  proved  that  the  brightness 
E  at  pixel  {x,y)  of  the  2nd  fixated  image  is  the  same  as  the  brightness  at  the  pixel 
{x  —  Tu,y  —  Tv)  of  the  2nd  initial  image  where  the  shifting  vector  (u,  v)  is  given  by 
eqn.  5.4  and  T  is  the  time  interval  between  two  initial  images. 

In  practice,  the  point  (x  —  Tu,y  —  Tv)  does  not  exactly  coincide  with  any  pixel. 
Instead  it  is  usually  surrounded  by  four  pixels  whose  brightnesses  may  be  denoted  by 
Ei,j,  Et,j+i,  Ei+i,j,  and  Ei+ij.^\,  fig.  11-1.  In  this  figure,  p  and  q  are  the  horizontal 
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Figure  11-1:  The  mapped  point  in  the  2nd  initial  image  does  not  usually  coincide 
with  any  single  pixel.  Instead  it  is  usually  surrounded  by  four  pixels. 
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and  vertical  distances  of  the  mapped  point  from  pixel  (i.j).  Considering  that  this 
can  happen  for  any  pixel,  the  average  \iE,  j  +  +  Ei+i.j+i)  is  not  a  good 

estimation  for  E  because  it  corrupts  the  constructed  image  by  introducing  aliasing. 

Bilinear  interpolation  of  the  surrounding  brightness  levels  has  proven  to  be  a  verv 
good  estimate  for  E  which  is  given  as, 

E  =  (1  -p)(l  -  q)E,^j+p{[  +q{l-  p)E,+x,j+pqE,+i_j+i.  (11.1) 

As  shown  in  fig.  11-1,  p  and  q  represent  the  horizontal  and  vertical  distance  of  the 
mapped  point  from  pixel  Such  an  algorithm  gives  the  largest  weight  to  the 

pixel  closest  to  the  mapped  point  and  results  in  the  exact  brightness  value  when  it 
coincides  with  any  pixel,  p  —  q  =  Q. 

All  the  constructed  images  in  this  work  are  obtained  using  bilinear  interpolation. 
Our  experimental  results  have  shown  that  such  interpolation  is  quite  satisfactory. 
There  are  some  other  techniques  such  as  bicubic  interpolation  [1,  13,  32.  49,  50]  which 
are  much  more  expensive,  however  we  did  not  find  that  we  needed  to  use  them  in  this 
work. 


11.3  Construction  of  Fixated  Images 

The  landscape  and  cup  image  sequences  in  figures  7-1  and  7-4  are  used  as  input 
(initial)  images  in  the  following  experiments.  As  we  discussed  earlier,  the  1st  initial 
images  (top  images)  in  those  figures  are  directly  used  ^ls  the  1st  fixated  images.  Then 
the  pixel  shifting  process  and  the  bilinear  interpolation  are  applied  to  the  2nd  initial 
images  (bottom  images  in  figures  7-1  and  7-4)  to  construct  the  2nd  fixated  images, 
figures  11-2  and  11-3.  These  constructed  images  are  quite  good  and  look  as  natural 
and  crisp  as  the  original  images  do.  We  will  describe  the  quality  of  these  images 
further  in  the  following  sections. 

Depending  on  the  size  and  direction  of  the  equivalent  rotational  velocity  (see 
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Figure  11-2:  The  constructed,  2nd  fixated^  image  for  the  landscape  image  sequence. 

chapter  5),  the  brightness  E  at  some  border  pixels  are  not  computable  because  they 
are  mapped  to  points  outside  the  initial  images  domain.  The  brightness  at  such 
bordering  pixels  are  given  an  arbitrary  value  of  0  which  causes  the  appearance  of 
bold  black  lines  at  the  border  of  constructed  images.  This  should  not  concern  us 
because  in  general  the  results  near  the  image  borders  are  not  considered  reliable 
anyway. 


11.4  Spatial  and  Temporal  Gradient  Maps 


The  gradient  maps  are  good  measures  for  studying  the  quality  and  characteristics  of 
fixated  image  sequences.  This  section  examines  the  gradient  maps  of  two  different 
fixated  image  sequences  that  we  have  constructed  from  real  image  sequences. 
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Figure  11-3;  The  constructed,  2nd  fixated  image,  for  the  cup  image  sequence. 

11.4.1  Landscape  fixated  image  sequence 

The  combination  of  the  1st  initial  image  (top  image  in  fig.  7-1)  and  the  ‘2nd  fixated 
image  in  fig.  1 1-2  form  the  landscape  fixated  image  sequence.  The  corresponding  spa¬ 
tial  gradient  maps  in  fig.  11-4  show  that  these  gradients  contain  valuable  information. 
The  vertical  and  horizontal  features  of  the  initial  images  are  indirectly  represented  in 
the  spatial  gradients. 

The  temporal  gradient  map  of  the  landscape  fixated  image  sequence  is  shown  in 
fig.  11-5.  This  map  contains  very  important  information.  First  of  all  it  clearly  shows 
the  characteristic  of  a  fixated  image  sequence.  It  is  clear  that  both  the  horizontal 
and  vertical  features  of  the  image  sequence  become  more  obvious  as  their  distance 
from  the  fixation  point  location  (image  center  in  this  ca.se)  increases.  Secondly,  the 
appearance  of  the  horizontal  and  vertical  lines  here  provides  hints  about  the  existence 
of  a  rotational  component  about  the  fixation  axis.  And  finally  the  dominant  vertical 
lines  are  an  indication  that  the  equivalent  rotational  velocity  has  a  major  component 
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about  the  vertical  axis. 


11.4.2  Cup  fixated  image  sequence 

The  fixated  cup  image  sequence  consists  of  the  top  image  in  fig.  7-4  (cis  the  1st 
fixated  image)  and  the  2nd  fixated  image  in  fig.  11-3.  Figure  11-6  shows  the  spatial 
gradient  maps  for  this  image  sequence.  The  horizontal  gradient  map  (top)  identifies 
the  vertical  edge-like  features  and  the  vertical  gradient  map  (bottom)  detects  the 
horizontal  edge-like  features  in  the  image.  We  should  emphasize  here  that  we  neither 
intended  to  find  edges  nor  have  we  used  those.  However,  it  is  important  to  observe 
that  spatial  gradients  (simple  horizontal  and  vertical  differences)  of  fixated  images 
indirectly  capture  important  features  of  the  images. 

Figure  11-7  represents  the  temporal  gradient  map  of  the  fixated  cup  image  se¬ 
quence.  This  map  is  dominated  by  vertical  lines  which  indicate  that  the  rotational 
component  about  the  fixation  axis  is  negligible  and  the  equivalent  rotational  veloc¬ 
ity  has  only  a  component  about  the  vertical  axis.  Furthermore  these  vertical  lines 
become  more  evident  as  their  distance  from  the  image  center  increase  which  is  an 
indication  that  the  fixation  point  is  located  near  the  image  center. 

11.5  Summary 

The  experimental  results  in  this  chapter  show  that  the  pixel  shifting  process  can  be 
easily  used  for  constructing  a  sequence  of  images  fixated  at  any  arbitrary  point.  This 
software  based  technique  is  computationally  simple  and  does  not  require  moving  the 
camera  for  tracking  the  desired  fixation  point. 

The  novel  representation  of  the  spatio-temporal  gradients  by  their  corresponding 
maps  showed  that  gradients  not  only  preserve  the  image  features  but  also  capture  the 
motion  in  a  unique  way  which  reflects  the  characteristics  of  fixated  image  sequences. 
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Figure  11-4:  The  spatial  gradient  maps  of  the  _/?a:ated  landscape  image  sequence  in 
X  direction  (top)  and  y  direction  (bottom). 
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Figure  11-5;  The  temporal  gradient  map  of  the  fixated  landscape  image  sequence. 
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Figure  11-6;  The  spatial  gradient  maps  of  the  fixated  cup  image  sequence  in 
X  direction  (top)  and  y  direction  (bottom) 
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Figure  11-7:  The  temporal  gradient  map  for  the  fixated  cup  image  sequence. 


Depth  Map  Recovery 


Chapter  12 


This  chapter  describes  how  depth  maps  are  recovered  from  real  image  sequences. 
It  also  describes  implementation  issues  and  the  techniques  used  in  the  recovery  of 
depth  maps. 

12.1  Introduction 


Earlier  in  chapter  3,  we  proved  that  ideally  the  depth  at  any  point  of  a  fixated  image 
is  given  by  eqn.  3.35, 


Z  = 


(vx  A.)  t 
I1R-.II 


(s-t) _ 

-  Et-  u;r,v  •  Ro 


(12.1) 


A 

where  Ro  is  the  unit  vector  along  the  fixation  axis  and  s  and  v  are  the  known  vector 
functions  of  pixel  position  (x,y)  and  spatial  gradients  {Ex,  Ey)  as  given  in  equations 
2.9  and  2.10. 

The  translational  velocity  t  is  obtained  by  finding  the  eigenvector  corresponding 
to  the  smallest  eigenvalue  of  matrix  M  in  eqn.  3.31.  The  optimal  patch  size  found  in 
chapter  9  is  used  for  the  estimation  of  t. 

All  the  computations  in  this  chapter  are  performed  using  the  data  from  the  fixated 
image  sequences  that  we  constructed  in  chapter  11. 
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12.2  Detecting  the  Depth  Flaws 

It  is  well  known  that  depth  recovery  from  real  images  is  not  perfect  because  of  noise 
and  other  characteristics  of  real  images.  This  section  describes  the  techniques  for 
detecting  pixels  where  depths  are  not  acceptable. 

Using  the  notations  Num  and  Denom  as, 

A/’um  =  (s  •  t)(s  •  t)  (12.2) 


and 

Denom  =  - ~ 

equation  12.1  can  be  written  as, 

^  _  Num 
Denom 


(12.3) 


(12.4) 


Using  this  equation,  we  can  compute  depth  Z  at  any  single  pixel  in  the  image.  How¬ 
ever,  the  recovered  depth  is  not  always  reliable.  We  call  a  depth  Z  unacceptable  if  it 
satisfies  any  of  the  following  cases. 

•  Case  1:  Denom  is  negative. 

This  condition  results  in  a  negative  depth  which  should  not  happen  in  our  vision 
system.  This  usually  happens  where  the  data  is  noisy. 

•  Case  2:  Denom  is  zero. 

This  caise  results  in  an  irrecoverable  depth  (Z  =  5)  or  wrong  depth  (Z  =  00). 

It  may  occur  due  to  many  reasons  such  as  zero  translational  velocity,  in  case  the 
pixel  is  in  a  patch  with  uniform  brightness  (zero  gradients),  or  when  the  apparent 
motion  is  in  a  direction  perpendicular  to  the  spatial  gradients. 

Figure  12-1  shows  the  depth  flaw  map  for  the  fixated  cup  image  sequence  obtained 
by  using  the  above  criteria  for  detecting  the  points  with  unacceptable  depth.  Any 
black  point  in  this  map  represents  a  pixel  whose  computed  depth  is  not  acceptable. 
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It  is  quite  obvious  from  this  figure  that  if  we  compute  the  depth  using  only  the  data 


Figure  12-1:  The  flaws  in  the  depth  map  for  the  fixated  cup  image  sequence.  The 
pixels  with  unacceptable  depth  are  shown  in  black. 


from  a  single  pixel,  then  we  will  end  up  with  considerable  number  of  pixels  where 
depths  are  not  acceptable. 


12.3  Constructing  a  Primary  Depth  Map 

Figure  12-2  shows  the  depth  map  where  each  depth  value  is  computed  using  only  the 
data  from  its  corresponding  pixel.  Using  such  a  method  leave  us  with  many  pixels  of 
unacceptable  depths  which  are  left  blank  (white)  in  this  depth  map. 

This  is  a  primary  depth  map  and  obviously  is  not  very  informative  because  depth 
information  is  missing  in  many  areas.  In  the  next  section  the  first  effort  is  made  for 
estimating  the  depth  at  such  points. 
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V&Mar-, 


Figure  12-2:  The  initial  depth  map  for  the  fixated  cup  image  sequence.  The  areas 
close  to  the  viewer  are  bright  and  the  pixels  whose  depths  are  not  acceptable  are  left 
blank  (white). 


12.3.1  Filling  in  the  Missing  Depths 


At  any  pixel  where  the  depth  information  is  missing  (depth  is  unacceptable),  we  can 
find  a  depth  estimate  by  averaging  the  reliable  depths  at  its  surrounding  pixels.  The 
notation  rj  is  used  for  the  radius  of  such  a  patch.  This  radius  is  defined  in  a  way 
that  forms  a  square  patch  whose  side  ha.s  a  length  of  (2  x  r/  -|-  1)  pixels.  Figure  12-3 
shows  the  corresponding  completed  depth  map.  A  maximum  patch  size  of  radius 
r/  =  6  pixels  has  been  used  for  finding  an  estimate  for  the  points  where  depths  were 
not  known  in  the  initial  depth  map,  fig.  12-2.  Although  this  primary  depth  map  is  not 
perfect,  it  delivers  very  useful  clues  about  the  boundary  of  objects  in  the  environment 
(books,  cup,  and  spoon). 


12.4:  Improving  the  Depth  Map 
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Figure  12-3;  The  completed  depth  map  for  the  fixated  cup  image  sequence  with 
r/  =  6  pixels.  The  areas  close  to  the  viewer  look  brighter. 

12.4  Improving  the  Depth  Map 

We  can  considerably  improve  the  depth  map  by  using  the  data  from  a  surrounding 
patch  for  computing  the  depth  at  any  pixel  point.  We  denote  the  radius  of  such  patch 
with  Tp.  Similar  to  r/,  the  radius  Tp  is  defined  in  a  way  to  form  a  square  patch  whose 
side  has  a  length  of  (2  x  Tp  1 )  pixels. 

Applying  such  a  simple  technique  decreases  the  number  of  depth  flaws  and  in¬ 
creases  the  quahty  of  depth  map  considerably.  Figure  12-4  shows  the  results  when  a 
patch  of  1  pixel  in  radius  is  used  for  depth  computation  at  any  pixel  (rp  =  1  pixel). 
Although  the  depth  flaws  (in  the  top  of  the  fig.  12-4)  have  not  disappeared,  they  have 
shrunk  noticeably  when  compared  to  the  previous  case. 

The  initial  depth  map  is  shown  in  the  middle  of  fig.  12-4  where  the  pixels  with 
unreliable  depth  estimates  are  left  blank  (white).  The  completed  depth  map  is  given 
at  the  bottom  of  fig.  12-4  where  a  patch  of  maximum  9  pixels  in  radius  (r/  =  9  pixels) 
is  used  for  finding  depth  estimates  at  points  where  depths  were  not  known  in  the  initial 
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depth  map.  The  shape  of  the  objects  in  the  image  have  started  to  become  identifiable 
in  this  completed  depth  map. 

12.5  Even  Better  Depth  Maps 

The  depth  maps  can  be  further  improved  by  using  larger  patches  for  depth  estimation 
at  any  single  pixel.  Figures  12-5  through  12-7  show  the  depth  flaw,  initial  depth,  and 
completed  depth  maps  for  cases  with  patch  sizes  of  radius  Tp  =  2,  3,  &  4  pixels.  The 
maximum  radial  patch  size  for  completing  the  depth  map  have  been  rj  =  11,  15,, 
and  17  pixels  respectively.  These  maps  show  that  the  environment  objects  (books, 
spoon,  cup,  and  even  the  background  poster)  become  more  identifiable  and  smoother. 

The  experimental  results  show  that  if  a  relatively  large  initial  patch  size  Xj,  is  used 
then  depth  map  may  loose  some  of  its  fine  details. 

12.6  Subsampling  the  Fixated  Images 

In  this  section,  we  have  subsampled  each  of  the  fixated  images  by  a  factor  of  2  before 
using  them  for  depth  recovery.  This  is  done  by  substituting  a  patch  of  2  x2  neighboring 
pixels  with  a  new  pixel  whose  brightness  is  an  average  of  4  initial  pixels.  This  is  the 
smallest  symmetric  subsampling  which  can  be  done  on  an  image.  We  expect  to  gain 
a  better  depth  map  because  subsampling  usually  leads  to  a  decrease  in  noise. 

The  depth  flaw  (top),  initial  depth  (middle),  and  complete  depth  (bottom)  maps 
for  the  subsampled  image  sequence  with  =  0  are  shown  in  fig.  12-8.  These  maps 
indicate  that  some  improvements  are  made  by  subsampling.  This  becomes  clear  if 
we  notice  that  in  the  depth  flaw  map  (top  of  fig.  12-8)  there  are  less  regions  with 
unacceptable  depths  than  in  the  corresponding  depth  map  obtained  from  images 
which  were  not  subsampled  (fig.  12-1).  The  initial  depth  map  (middle)  is  not  very 
informative  here.  As  before,  the  pixels  with  unacceptable  depths  are  left  blank  (white) 
in  the  initial  depth  map.  A  patch  of  maximum  4  pixels  in  radius  (r/  =  4  pixels)  is 
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used  for  completing  the  initial  depth  map.  Even  this  completed  depth  map  (bottom 
of  fig.  12-8)  offers  only  a  very  vague  intuition  about  the  boundaries  of  the  objects  in 
the  image. 

In  the  next  step,  we  have  used  a  patch  of  1  pixel  in  radius  (rp  =  1)  for  the  depth 
estimation  at  any  single  pixel.  The  results  are  shown  in  fig.  12-9.  As  expected,  the 
depth  flaws  have  not  fully  disappeared  (top).  These  points  are  left  blank  (white)  in 
the  initial  depth  map  (middle).  For  obtaining  the  complete  depth  map  (bottom),  a 
patch  of  maximum  6  pixels  in  radius  (r/  =  6)  is  used  in  this  case.  Considering  the 
subsampling  size  of  2  x  2  pixels,  these  results  are  located  somewhere  between  the 
results  of  nonsampled  images  with  Tp  =  2,  and  Tp  =  3  (figures  12-5,  and  12-6). 

Figure  12-10  shows  the  results  for  the  subsampled  images  for  the  case  with  Tp  = 
2  pixels,  and  ry  =  9  pixels. 

A  careful  observation  shows  that  there  are  not  many  differences  between  sampled 
and  nonsampled  results  from  the  point  of  view  of  identifying  different  objects  in  the 
environment.  However,  the  depth  maps  of  subsampled  images  have  much  better 
quality  and  are  relatively  free  from  the  systematic  noise.  This  is  quite  clear  if  we 
notice  that  the  vertical  black  lines  between  the  books  which  were  seen  in  previous 
depth  maps  are  absent  here.  These  lines  represent  narrow  but  deep  vertical  gaps 
between  the  books  which  did  not  actually  exist  in  the  environment. 

Furthermore,  due  to  the  printer  grey  level  limitation,  quality  depth  maps  cannot 
be  printed  out.  The  computed  depth  maps  are  much  better  than  what  are  shown  here. 
For  example  each  book  has  its  relatively  uniform  depth  which  clearly  distinguishes  it 
from  its  neighboring  books  when  there  is  a  depth  change  in  the  real  environment. 


12.7  Summary 


This  chapter  combined  the  individual  results  that  we  had  obtained  in  previous  chap¬ 
ters  and  used  them  in  the  recovery  of  depth  maps.  The  recovered  depth  maps  are 
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quite  good  considering  that  the  input  to  the  system  was  only  two  unrestricted  frames. 
These  images  were  real  and  noisy.  Furthermore,  the  motion  was  not  known  in  ad¬ 
vance,  and  the  recovered  motion  was  used  in  the  computations.  It  is  also  important 
to  notice  that  simple  computations  have  been  involved  in  all  the  steps. 

The  experimental  results  show  that  by  subsampling  the  initial  images,  much  better 
depth  maps  are  obtained.  This  is  due  to  the  fact  that  subsampling  acts  as  a  low  pass 
filter  and  eliminates  the  high  frequency  noise  which  is  inherent  in  real  images. 

An  overall  study  of  the  experimental  results  in  this  chapter  shows  that  depth  maps 
obtained  by  using  an  rp  =  2  or  3  pixels  seem  to  be  a  good  choice.  This  is  probably 
because  of  the  fact  that  a  mask  of  2  x  2  pixels  is  used  for  the  computation  of  gradients. 
As  a  result,  using  smaller  rp  will  not  give  a  good  depth  map.  On  the  other  hand, 
using  larger  Tp’s  may  result  in  the  elimination  of  some  fine  details  of  the  depth  map 
and  does  not  improve  the  overall  quality  of  the  depth  map. 

It  should  also  be  pointed  out  that  we  do  not  have  any  control  over  choosing  rj. 
The  algorithm  automatically  chooses  an  r/  large  enough  to  include  pixels  with  reliable 
depths  in  order  to  find  estimates  for  depths  at  pixels  where  depths  were  missing  in 
the  initial  depth  map. 

All  the  results  in  this  chapter  were  constructed  by  using  a  single  rj  for  obtaining 
depth  estimate  at  any  pixel  point  with  an  unacceptable  depth  value.  An  adaptive 
approach  which  chooses  Tp  appropriately  at  any  desired  pixel  point  will  result  in 
smother  depth  maps. 


Figure  12-4;  The  depth  flaw  (top),  initial  depth  (middle),  and  completed  depth  (bot¬ 
tom)  maps  for  the  fixated  cup  image  sequence  with  Tp  =  1  pixel,  and  r/  =  9  pixels. 
The  areas  close  to  the  viewer  look  brighter. 


Figure  12-5:  The  depth  flaw  (top),  initial  depth  (middle),  and  completed  depth  (bot¬ 
tom)  maps  for  the  fixated  cup  image  sequence  with  Tp  =  2  pixels,  and  r/  =  1 1  pixels. 
The  areas  close  to  the  viewer  are  shown  brighter. 
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Figure  12-6;  The  depth  flaw  (top),  initial  depth  (middle),  and  completed  depth  (bot¬ 
tom)  maps  for  the  fixated  cup  image  sequence  with  Tp  =  3  pixels,  and  r/  =  15  pixels. 
The  areas  close  to  the  viewer  look  brighter. 
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Figure  12-7;  The  depth  flaw  (top),  initial  depth  (middle),  and  completed  depth  (bot¬ 
tom)  maps  for  the  fixated  cup  image  sequence  with  r^  =  4  pixels,  and  r;  =  17  pixels. 
The  areas  close  to  the  viewer  look  brighter. 
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Figure  12-8:  The  depth  flaw  (top),  initial  depth  (middle),  and  completed  depth  (bot¬ 
tom)  maps  for  the  subsampled  (by  2)  fixated  cup  image  sequence  with  Tp  =  0,  and 
ry  =  4  pixels.  The  areas  close  to  the  viewer  look  brighter. 
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Figure  12-9:  The  depth  flaw  (top),  initial  depth  (middle),  and  completed  depth  (bot¬ 
tom)  maps  for  the  subsampled  (by  2)  flxated  cup  image  sequence  with  Tp  =  1,  and 
r/  =  6  pixels.  The  areas  close  to  the  viewer  look  brighter. 
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Figure  12-10:  The  depth  flaw  (top),  initial  depth  (middle),  and  completed  depth 
(bottom)  maps  for  the  subsampled  (by  2)  fixated  cup  image  sequence  with  =  2, 
and  r/  =  9  pixels.  The  areas  close  to  the  viewer  look  brighter. 
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Chapter  13 


Camera  calibration  is  an  important  area  of  research  involving  the  study  of  tech¬ 
niques  for  obtaining  reliable  estimates  for  the  required  internal  and  external  param¬ 
eters  of  a  camera  in  a  vision  system. 

For  many  years,  computer  vision  scientists  have  been  working  on  different  aspects 
of  camera  calibration  problems  such  as  focal  length  y^principal  distance)  [77,  86,  87], 
principal  point  (image  center)  [33,  86],  scale  factor  (difference  between  the  scanning 
frequency  of  the  camera  sensor  plane  and  the  scanning  frequency  of  the  image  cap¬ 
turing  board  frame  buffer)  [33,  47],  intrinsic  parameters  (camera  internal  geomet¬ 
ric  and  optical  characteristics)  [77],  extrinsic  parameters  (the  3D  position  and  ori¬ 
entation  of  the  camera  coordinate  relative  to  a  certain  world  coordinate  system) 
[77,  85,  87,  18,  16,  86],  and  the  hand-eye  transform  system  (the  3D  position  and  ori¬ 
entation  of  a  camera  relative  to  the  last  joint  of  a  robot  manipulator  in  an  eye-on-hand 
configuration)  [78,  79,  12]. 

In  the  previous  chapters  we  saw  that  some  parameters  such  as  focal  length  and 
principal  point  have  important  role  in  the  formulations.  Manufacturers  usually  give 
a  nominal  value  for  the  focal  length  but  this  nominal  value  is  not  always  sufficiently 
accurate  to  be  used  in  the  computations.  Some  other  important  parameters  such  as 
the  true  principal  point  are  not  given  at  all. 

In  this  chapter,  some  of  the  calibration  techniques  used  in  this  work  will  be  de¬ 
scribed. 
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13.1  Principal  Point  Calibration 

The  principal  point  is  where  the  optical  axis  intersects  the  image  plane;  see  fig.  2- 1 . 
Ideally,  the  principal  point  is  located  at  the  center  of  the  image  plane.  However,  in 
off-the-shelf  cameras  the  principal  point  is  not  necessarily  located  at  the  center  of  the 
image  plane.  Finding  the  true  location  of  the  principal  point  is  important  because 
those  values  appear  in  our  algorithms. 

For  the  cup  images  the  nominal  image  center  was  used  as  the  principal  point 
because  the  camera  was  not  accessible  to  be  calibrated.  On  the  other  hand,  in  the 
case  of  the  landscape  images  the  true  principal  point  was  obtained  using  a  direct 
optical  method  [33]. 

The  experimental  results  showed  that  the  true  principal  point  was  considerably 
off  from  the  nominal  image  center.  It  was  located  at  about  13  pixels  to  the  left  and 
13  pixels  below  the  nominal  image  center. 

13.1.1  Direct  optical  method 

The  direct  optical  method  is  a  very  simple  and  accurate  calibration  technique  for 
finding  the  principal  point.  This  method  requires  only  a  laser.  The  lens  assembly  is 
used  as  a  reflecting  surface  and  therefore,  the  lens  can  remain  mounted  on  the  camera. 

When  a  laser  beam  is  pointed  at  a  lens  assembly,  part  of  the  light  is  reflected 
when  the  beam  enters  the  glass  and  also  when  it  leaves  it.  Multiple  reflections  occur 
when  the  beam  is  reflected  within  the  lens  and  can  be  observed  on  a  piece  of  paper 
attached  to  the  front  of  the  laser  with  a  small  hole  for  the  primary  beam.  With  some 
experimental  skill  the  laser  can  be  adjusted  relative  to  the  lens  so  that  all  reflections 
coincide  with  the  primary  beam,  indicating  that  it  is  aligned  with  the  optical  axis. 
Once  aligned,  an  attenuation  filter  is  placed  in  the  optical  path,  the  camera  is  turned 
on  and  the  center  of  the  light  spot  observed  can  be  used  as  the  image  center. 

This  method  is  commonly  used  in  experimental  optics  to  align  lens  assemblies  and 
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gives  reproducible  results.  If  the  lens  is  removed,  the  reflection  from  the  surface  of 
the  image  sensor  will  also  give  an  indication  of  its  perpendicularity  with  respect  to 
the  optical  axis.  When  a  low  power  laser  (<  lOmW)  is  used,  no  harm  is  done  to  a 
discrete  array  camera  sensor  (CCD).  However,  vidicon  tubes  might  be  damaged  by 
burning  in. 

13.2  Calibration  of  the  Rotation  Axis 

In  the  landscape  experiments,  we  did  not  explicitly  apply  any  vertical  translation 
(along  Y  axis).  However,  fig.  8-2  show  a  considerable  vertical  translation  of  about 
—0.9  mm.  This  is  mainly  because  the  real  rotation  axis  does  not  pciss  through  the 
center  of  projection^ 

To  clarify  this,  we  should  mention  that  in  motion  vision,  it  is  assumed  that  the 
rotation  axis  passes  through  the  origin  of  the  viewer  centered  coordinate  system,  i.e 
the  center  of  projection.  But  at  the  CMU  Imaging  Laboratory,  the  rotation  mechanism 
W21S  not  set  up  to  align  the  Z  axis  of  rotation  with  the  optical  axis.  The  CMU  vision 
system  was  equipped  with  several  cameras  and  evidently  the  camera  used  for  taking 
the  landscape  images  was  set  off  center.  However,  for  obtaining  the  experimental 
results,  we  have  employed  algorithms  which  erroneously  assume  that  the  rotation 
axis  passes  through  the  center  of  projection. 

According  to  the  basic  kinematics,  the  compensating  translation  which  results 
from  shifting  the  rotation  axis  is  given  by 

Vo  =  -ti;xB  (13.1) 

where  B  is  a  vector  extending  from  a  point  on  the  real  (desired)  rotation  axis  to  a  point 

'  If  the  CCD  edges  are  not  accurately  aligned  with  the  horizontal  and  vertical  axes  of  the  camera 
frame,  i.e.  the  CCD  is  mounted  at  an  angle  with  respect  to  the  camera  coordinate  system,  such 
kind  of  errors  happen  in  both  vertical  and  horizontal  directions.  But  it  is  not  the  case  here  because 
the  inaccuracy  of  motion  estimation  has  occurred  only  in  the  vertical  direction. 
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on  the  assumed  rotation  axis;  see  fig.  13-1.  In  our  experiment,  Vo  =  —(<^2)  x  {bx) 


Figure  13-1:  In  motion  vision  the  assumption  is  that  the  rotation  axis  peisses  through 
the  center  of  projection  (origin).  In  the  landscape  image  sequence,  the  true  rotation 
is  parallel  to  the  optical  axis  but  does  not  pass  trough  the  origin.  This  will  result  in 
a  translation  which  should  be  compensated  for. 


where  Vo  =  — 0.9y  mm,  and  u;  =  —0.3  degree.  As  a  result,  the  real  rotation  axis  was 
located  at  about  b  =  — (— 0.9)/((— 0.3  x  7r)/180)  =  —172  mm  perpendicular  distance 
from  the  optical  axis  in  the  horizontal  plane. 


13.2.1  Generalization 

A  similar  method  can  be  used  for  the  calibration  of  the  rotation  axis  which  is  parallel 
to  the  optical  axis  in  a  camera  system  arrangement  in  the  general  case. 

In  order  to  find  the  real  location  of  the  rotation  axis,  the  following  steps  should 
be  taken: 

•  Step  1:  Apply  a  pure  rotation  about  the  axis  which  is  supposed  to  be  the  optical 
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axis. 

•  Step  2:  If  rotation  a;R,„  is  not  accurately  known,  compute  it  by  applying  eqn.  4.4 
to  a  relatively  large  patch  around  the  principal  point. 

•  Step  3:  Estimate  the  apparent  motion  («o,  at  the  principal  point  using  the 
eqn.  8.4  or  4.4. 

•  Step  4'  The  real  location  of  the  rotation  axis  is  given  by, 


(13.2) 


where  Zo  is  depth  at  the  principal  point,  and  /  is  the  focal  length  of  the  camera. 

Point  {bx,  by)  represents  the  location  where  the  real  rotation  axis  (which  is  parallel 
to  the  optical  axis)  intersects  the  image  plane. 


13.3  Summary 

Focal  length,  principal  point,  and  the  rotation  axis  position  are  the  three  most  impor¬ 
tant  factors  which  can  effect  the  computations  in  our  motion  vision  algorithms. 

The  experimental  results  show  that  we  may  be  able  to  get  away  with  using  the 
nominal  focal  length  as  the  focal  length,  and  using  the  image  center  as  the  principal 
point.  However,  we  have  to  calibrate  the  system  for  finding  the  real  rotation  axis  and 
compensate  for  the  resultant  translation  if  the  rotation  axis  does  not  pass  through 
the  projection  center.  The  calibration  technique  introduced  in  this  chapter  offers  an 
easy  and  reliable  solution  to  this  important  problem. 


Conclusions 

Chapter  14 


This  thesis  introduced  a  general  motion  vision  system  which  takes  any  sequence 
of  images  as  its  input  and  recovers  the  motion  and  shape  without  any  need  to  check, 
choose,  and  adjust  parameters.  A  complete  implementation  of  this  motion  vision 
system  has  been  tested  on  real  images  and  the  critical  issues  involved  in  the  its 
autonomous  implementation  have  been  studied.  This  chapter  makes  some  concluding 
remarks  about  this  fixation  based  motion  vision  system. 

14.1  Features 

•  In  contrast  to  previous  work  done  in  the  area  of  motion  vision,  our  solutions  are 
general  and  do  not  impose  any  severe  restrictions  on  the  motion  or  the  structure  of 
the  environment. 

•  The  fixation  method  uses  neither  optical  flow  nor  feature  correspondence.  In¬ 
stead,  it  directly  employs  the  image  brightness  gradients. 

•  Our  motion  vision  system  neither  requires  tracked  images  as  input  nor  uses 
hardware  tracking  for  obtaining  fixated  images.  Insteaui,  it  introduces  a  pixel  shifting 
process  for  constructing  fixated  image  sequences  at  any  arbitrary  fixation  point.  This 
process  is  done  entirely  in  software  without  moving  the  camera  for  tracking. 

•  The  fixation  method  does  not  restrict  the  fixation  point  and  virtually  any  point 
can  be  chosen  as  the  fixation  point. 

•  The  algorithms  and  formulations  presented  in  the  fixation  method  are  simple 
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and  have  been  successfully  implemented  on  real  images. 


14.2  Results 

•  Good  estimations  for  motion  parameters  can  be  obtained  using  optimum  patch  sizes 
(see  chapter  8). 

•  The  novel  introduction  and  use  of  normalized  error  has  enabled  us  to  find  opti¬ 
mum  patch  sizes  which  result  in  good  estimates  for  motion  parameters.  This  technique 
has  been  implemented  on  many  real  image  sequences  (see  chapter  9). 

•  The  novel  pixel  shifting  process  for  constructing  fixated  (tracked)  images  has 
been  successfully  tested  on  several  real  image  sequences  (see  chapter  11). 

•  The  experimental  results  in  chapter  12  show  that  good  depth  maps  can  be 
obtained  using  only  two  monocular  real  images.  If  we  use  the  data  from  a  single  pixel 
for  recovering  the  corresponding  depth,  the  reliable  depth  map  will  be  sparse.  Using 
the  information  from  several  pixels  in  a  surrounding  patch  for  finding  the  depth  at 
its  central  point  results  in  a  relatively  dense  map  of  reliable  depths.  We  can  obtain 
even  better  results  by  subsampling  the  initial  images.  Subsampling  acts  as  a  low  pass 
filter  and  overcomes  some  of  inherent  high  frequency  noise  in  real  images. 

•  We  may  get  away  with  using  the  nominal  focal  length  and  principal  point  in  the 
fixation  formulations,  but  we  have  to  make  sure  to  calibrate  the  imaging  system  for 
the  real  rotation  axis.  The  method  described  in  chapter  13  offers  a  simple  solution  to 
this  important  practical  problem  which  can  result  in  considerable  motion  estimation 
errors  if  it  is  not  detected  and  compensated  for. 

•  The  implementations  were  done  on  a  Sun  SPARCstation  IPX  using  C  codes. 
Despite  not  using  either  parallel  or  optimized  programs,  the  actual  run-time  for  find¬ 
ing  the  motion  parameters  and  the  depth  map  for  an  image  of  227  x  280  pixels  was 
about  a  fraction  of  second  and  a  few  seconds  respectively. 
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14.3  Assumptions 

•  In  the  process  of  solving  the  general  motion  vision  problem  and  writing  the  eqn.  3.27, 
we  assumed  that  motion  parameters  can  be  obtained  using  a  small  patch  around 
the  fixation  point.  This  is  a  pure  geometric  assumption  and  does  not  place  any 
restrictions  on  the  depth  topology.  Numerous  experimental  results  in  chapter  9  show 
that  optimum  patch  sizes  are  small  enough  to  justify  our  assumption. 

•  This  work  assumes  that  there  is  one  rigid  motion  between  the  environment 
and  the  observer.  However,  small  deviations  from  rigidity  is  tolerated  by  the  system 
because  it  is  treated  as  noise  and  the  least  squares  methods  finds  the  best  solution 
which  fits  the  whole  data. 


14.4  Shortcomings 

•  The  fixation  method  fails  if  the  fixation  point  is  located  at  the  center  of  a  uniform 
brightness  patch  because  in  such  a  case,  motion  will  be  undetectable.  However,  we 
have  presented  a  mechanism  for  preventing  this  from  happening  by  introducing  an 
autonomous  technique  which  chooses  an  appropriate  location  for  the  fixation  point 
(see  chapter  10). 

14.5  Relation  to  Other  Works 

•  As  oppose  to  other  work  done  in  area  of  direct  methods,  our  fixation  technique 
estimates  both  the  motion  and  shape  for  the  general  case  [69,  60]. 

•  In  recent  years,  many  Kalman  filter  based  techniques  have  tried  to  improve  the 
depth  estimations  over  time  by  using  more  than  two  frames  [38,  39,  40,  56,  57,  58,  59, 
25).  These  techniques  not  only  need  to  know  the  motion  in  advance  but  also  require 
a  good  initial  guess  for  the  depth  map  in  order  to  converge  to  a  solution.  Despite 
these  major  advantages  of  Kalman  filter  methods,  the  depth  maps  recovered  by  our 
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fixation  method  are  far  more  superior  compared  to  those  obtained  by  the  Kalman 
filtering  methods  even  after  several  iterations  [26,  27]. 

•  Recently,  Tomasi  and  Kanade  [76,  75]  introduced  a  feature  based  technique  for 
recovering  the  motion  and  shape  from  a  sequence  of  images.  Their  method  is  different 
from  our  work  in  the  following  sense: 

-  It  assumes  orthographic  projection  which  handicaps  the  system  when  dealing 
with  close  by  objects. 

-  It  uses  feature  correspondence. 

-  It  requires  choosing  and  tracking  many  feature  points. 

-  Depth  is  obtained  only  at  the  feature  points. 

-  It  is  computationally  very  expensive. 


14.6  Future  Extensions 

•  The  motion  estimates  obtained  from  fixation  method  are  quite  satisfactory.  However, 
the  depth  maps  may  be  improved  by  using  more  than  two  image  frames  in  a  Kalman 
filter  based  system  as  follows: 

-  Converting  the  input  images  to  a  sequence  of  fixated  images  at  a  desired  fixation 
point  using  the  pixel  shifting  process. 

-  Obtaining  the  motion  estimates  from  the  fixation  method  if  it  is  not  known. 

-  Using  the  depth  map  estimates  from  the  fixation  method  as  the  initial  guess  for 
the  Kalman  filter  system. 

Employing  such  a  hybrid  system  can  potentially  improve  the  depth  map  and 
accelerate  the  convergence  rate  of  the  Kalman  filter. 

•  The  algorithms  and  formulations  in  the  fixation  method  are  very  well  suited 
to  parallel  implementation.  Such  an  approach  overwhelmingly  improves  the  system 
performance  because  most  of  the  operations  are  simple  additions  and  subtractions 
which  are  done  independently  but  all  over  the  image. 
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•  Due  to  their  parallel  nature,  the  fixation  algorithms  can  be  implemented  on  a 
single  chip  using  analog  VLSI  techniques  such  as  the  one  by  Mead  [41].  This  seems 
to  be  an  attractive  approach  for  task  specific  applications. 

•  By  using  segmentation,  this  work  can  be  extended  to  multiple  motion  case. 


Derivation  of  Brightness  Change 
Constraint  Equation 

Appendix  A 


The  brightness  change  constraint  equation  (BCCE)  relates  the  change  in  the  image 
brightness  at  a  point  {x,y)  to  the  apparent  velocity  (u,u)  of  the  brightness  pattern 
at  that  point  in  the  image.  This  appendix  describes  in  detail  the  steps  involved  in 
the  derivation  of  the  BCCE  [30,  54,  29). 

Let  E{x,y,t)  denote  the  image  brightness  at  time  t  at  the  image  point  (x,j/). 
Then,  if  u{x,y)  and  v{x,y)  are  the  x  and  y  components  of  the  apparent  velocity  at 
the  point,  we  expect  that  the  brightness  will  be  the  same  at  time  t  +  6t  a.t  the  point 
(x  +  6x,  y  4-  by),  where  Sx  =  uSt  and  8y  =  v6t.  In  other  words, 

E{x,y,t)  =  E{x  +  uSt,y  +  v6t,t  +  6t)  (A.l) 

for  small  time  interval  St.  The  underlying  assumption  in  writing  the  eqn.  A.l  is  slow 
spatio-temporal  variations  in  lighting  which  is  true  for  many  practical  applications. 

If  brightness  varies  smoothly  with  x,  y,  and  f,  we  can  expand  the  right  hand  side 
of  the  above  equation  in  a  Taylor  series  to  obtain 


,  r./  X  c  dE  c  dE  c  dE 

E(x,y,t)  =  E{x,y,t)  +  Sx—  +  6y—  +  St—  +  e 


(A.2) 


where  e  includes  second-  and  higher-order  terms  in  Sx,  Sy,  and  St.  Canceling  E{x,  y,  t). 
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dividing  through  by  St,  and  taking  the  limit  as  — >  0,  we  obtain 

^-n 

dt  dx  ^  dt  dy  ^  dt 


(A.3) 


which  is  actually  just  the  expansion  of  the  total  derivative  of  E  with  respect  to  time 
into  its  partial  derivatives,  in  other  words 


dt 


=  0. 


Using  the  abbreviations 


and 


equation  A.3  can  be  written  as 


Xt 


dx 

dt 


dt 


dE 

dx 

M 

dy 

M 

dt 


Et  +  xtEx  +  ytEy  —  0. 


(A.4) 


(A.5) 


(A.6) 


(A.7) 


The  above  equation  is  called  the  brightness  change  constraint  equation  because  it 
expresses  a  constraint  on  the  components  Xt  and  yt  of  the  apparent  velocity  at  a  point 
{x,y)  in  the  image. 

In  appendix  B,  we  will  show  how  the  derivatives  Ex,  Ey,  and  Et  are  estimated  at 
any  image  point. 


Computation  of  Brightness 
Gradients 

Appendix  B 


The  spatial  and  temporal  derivatives  of  the  image  brightnesses  are  the  basic  data 
blocks  in  the  direct  methods.  This  appendix  describes  the  formulations  behind  the 
estimation  of  the  brightness  gradients  in  images  [30,  29]. 

The  spatial  brightness  gradients  Ex,  Ey,  and  temporal  brightness  gradient  Et  are 
computed  simply  by  using  the  first  differences  of  image  brightness  values  on  a  cubic 
grid;  see  fig.  B-1. 

Using  the  indices  i,  j,  and  k  to  represent  x,  y,  and  time  t  respectively,  the  estimates 
of  spatial  gradients  Ex  and  Ey  are  give  by: 

+  Eij^k+\  +  (B.l) 


and 


{Ei,j,k  +  EiJ,k+l  + 


(B.2) 


138 


Appendix  B:  Computation  of  Brightness  Gradients 


Figure  B-1:  The  first  brightness  derivatives  required  in  the  direct  methods  can  be 
estimated  using  first  differences  in  a  2  x  2  x  2  cube  of  brightness  values.  The  estimates 
apply  to  the  point  where  four  neighboring  pixels  in  an  image  meet,  and  at  a  time 
halfway  between  two  successive  images. 

and  the  temporal  gradient  Et  is 

—{Ei,j,k  +  Ei,j+i,k  +  +  £',+i,j+i,*:)).  (B.3) 


These  formulations  give  the  brightness  gradients  at  a  point  lying  between  four  neigh¬ 
boring  pixels,  and  between  successive  images. 

Considering  the  fact  that  we  perform  spatial  tessellation  by  using  pixels  and  tem¬ 
poral  tessellation  by  employing  individual  time  varying  frames,  the  above  algorithms 
compensate  for  part  of  the  tessellation  errors  involved  in  discrete  digitized  images. 


Depth  at  Fixation  Point 

Appendix  C 


The  results  in  chapter  3  show  that  after  obtaining  the  translation  t,  we  need  to  find 
Za  (depth  at  the  fixation  point)  in  order  to  estimate  a  depth  Z  at  any  point  (x,  y)  in 
the  image  plane.  This  appendix  introduces  an  algorithm  for  finding  the  depth  Zo- 
At  the  fixation  point,  eqn.  3.26  is  exactly  expanded  to 

+  a;R,Vo  •  Ro  +  (-^ - ~)(So-t)  =  0  (C.l) 


which  is  similar  to  eqn.  3.27.  Theoretically,  all  terms  of  the  eqn.  C.l  vanish  because 
Et  is  zero  at  the  fixation  point,  and  v-r  =  0  applies  to  all  points  including  the  fixation 
point  which  means  Vo  •  Ro  =  =0.  As  a  result,  we  cannot  directly  obtain  the 

depth  Zo  from  eqn.  3.26.  However,  at  any  point  i  around  the  fixation  point,  depth 
Zoi  can  be  obtained  from  eqn.  3.26  as 


^  1  /  X  r„  ^  ^ 

V  ■ 


(C.2) 


By  averaging  N  of  such  neighboring  depths,  we  can  estimate  the  depth  Zo  as 


-J—t .  T  (  ^  ~  **^°^*^^* 

^Ikoll  V^t.llroll  +  •  To); 


(C.3) 


where  Sj,  v^,  and  are  computed  for  N  points  around  the  fixation  point.  In  eqn.  C.2, 
it  is  assumed  that  Zoi  ^  which  is  valid  considering  the  averaging  in  eqn.  C.3. 
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